CN112083956B

CN112083956B - Heterogeneous platform-oriented automatic management system for complex pointer data structure

Info

Publication number: CN112083956B
Application number: CN202010971038.9A
Authority: CN
Inventors: 张伟哲; 何慧; 王法瑞; 方滨兴; 郝萌; 郭浩男
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2022-12-09
Anticipated expiration: 2040-09-15
Also published as: CN112083956A

Abstract

An automatic management system for a complex pointer data structure facing a heterogeneous platform relates to the technical field of heterogeneous programming. The invention aims to realize the automatic management of a complex pointer data structure in an OpenMP Offloading program on a heterogeneous computing platform and ensure the data consistency. The invention includes: the information collection module is used for carrying out static analysis on the source program and collecting program information; the automatic conversion module is mainly responsible for modifying the source code at a proper position according to different variable types and inserting a proper runtime API; and the runtime module is mainly responsible for realizing the memory management operation of the C + + standard again by using the cudaMallocManged () and the cudaFree () and providing an interface outwards. The invention can automatically manage the memory allocation, release and data transmission of the complex pointer data structure in the OpenMP offload program between the CPU and the GPU memory, and ensure the data consistency; therefore, convenience is provided for the development of the OpenMP Offloading program.

Description

Heterogeneous platform-oriented automatic management system for complex pointer data structure

Technical Field

The invention relates to an automatic management system for a complex pointer data structure in an OpenMP off-loading program, and relates to the technical field of heterogeneous programming.

Background

OpenMP was introduced by the OpenMP Architecture Review Board and has been widely accepted as a set of instructive compilation processing schemes (Compiler Directive) for multiprocessor programming of shared memory parallel systems [1]. The OpenMP supported programming languages include C, C + +, and Fortran; and the Compiler supporting OpenMP includes Sun Compiler, GNU Compiler, and Intel Compiler, etc. OpenMP provides a high-level abstract description of the parallel algorithm, and programmers specify their intentions by adding special pragma primitives to the source code, so that compilers can automatically parallelize programs and add synchronization mutexes and communication where necessary.

In the field of high performance computing, various accelerators (e.g., GPUs, FPGAs, DSPs, MLUs, etc.) have become a significant source of computing power in addition to CPUs. From version 4.0, openMP adds Offloading (Offloading) characteristics, openmpoffload supports the heterogeneous programming model of CPU + accelerator; through the development of versions 4.5 and 5.0, openmpoffroading is gradually improved. Openmpoffload provides possibility for OpenMP programs to fully utilize computing power of heterogeneous computing platforms, but modifying existing OpenMP CPU programs into programs conforming to Offloading characteristics is still a difficult, tedious and error-prone task, and especially, the programs include complex pointer data structures, such as: class nesting pointers, vector containers, multilevel nesting pointers, etc.

Although the openmpoffusing syntax is simple, the user is still required to manage the data transfer operation between the CPU and the accelerator by using the relevant pragma primitive display, which causes great inconvenience to the developer, especially when a complex pointer data structure is encountered. For example, in a vector container in C + +, memory allocation is implicit, and it is difficult for a developer to control memory allocation and data transmission, and it is also difficult to use the Offloading property. For the nested pointer or the multi-level nested pointer, the task of processing the allocation, release and data transmission of the memory pointed by the pointers of different levels in the memory spaces of the CPU and the accelerator is tedious and very easy to make mistakes, which also makes developers overshadow the characteristics.

The CUDA programming model of the NVIDIA GPU platform supports Unified Memory (UM) characteristics from version 6.0; this feature unifies the address spaces of the CPU and GPU and automatically manages data transfer between the CPU and GPU. The unified memory characteristic provides a possible technical means for the automatic processing of a complex pointer data structure in the development of an OpenMP off-loading program.

Disclosure of Invention

The technical problem to be solved by the invention is as follows:

the invention aims to provide an automatic management system for a complex pointer data structure in an OpenMP off-floating program for a heterogeneous platform, and aims to solve the problems that in the prior art, on the basis of an OpenMP CPU program, memory allocation and release statements cannot be automatically modified and relevant pragma primitives cannot be automatically inserted, data transmission between a CPU and an accelerator cannot be automatically managed, and data consistency of the program containing the complex pointer data structure on the heterogeneous computing platform cannot be ensured, so that the performance of the program is influenced.

The technical scheme adopted by the invention for solving the technical problems is as follows:

an automatic management system of a complex pointer data structure facing a heterogeneous platform is used for realizing the automatic management of the complex pointer data structure in an OpenMPOfflooding program on the heterogeneous computing platform and ensuring the data consistency;

the system comprises the following three modules:

an information collection module that has two functions: 1) Performing static analysis on the source program to collect program information; 2) Establishing an abstract representation, namely a serial-parallel control flow diagram, for a source program based on the collected information;

the working process of the module comprises the following two steps: 1) Generating a corresponding Abstract Syntax Tree (AST) and a Control Flow Graph (CFG) from a C or C + + source code through a Clang compiler, traversing the AST and the CFG, distinguishing a serial-parallel domain, and acquiring detailed program information; 2) Generating a serial-parallel control flow graph according to the information;

the automatic conversion module is mainly responsible for inserting an API (application program interface) in operation into a source code based on a serial-parallel control flow graph so as to complete code conversion; firstly, determining the type of complex pointer data according to complex pointer variable information stored in a serial-parallel control flow graph; then according to different types, inserting proper runtime API in the proper position of the source code to complete code conversion; in this way, the memory allocation, release and data transmission operations related to the complex pointer variable are all taken over by the runtime, so that the complex pointer variable can be automatically managed by the runtime, and the data consistency between the CPU and the GPU is ensured;

the runtime module is mainly responsible for realizing the following operations of the complex pointer data type based on the unified memory: memory allocation and release operations on the CPU and the GPU and automatic data transmission operation between the CPU and the GPU; the runtime module is composed of a UMMaper class and an allocator class, and the UMMaper class and the allocator class provide interfaces for memory allocation and release to the outside in the form of API interfaces.

Further, the information collection function of the information collection module is realized by the following steps:

firstly, performing lexical analysis and semantic analysis on a target program by using a Clang compiler, and generating an Abstract Syntax Tree (AST) and a Control Flow Graph (CFG);

then, performing the following three static analyses on the AST and the CFG to obtain function call relation information, variable related information and serial/parallel domain related information, and providing information support for the subsequent serial-parallel control flow diagram establishment and code conversion;

analysis one, function call relation analysis: for an AST, the following two steps of work are performed:

1) Recursively traversing each node on the AST;

2) If the current node is a function definition node, saving the function name and the sub-function information called by the function;

analyzing two, variable information analysis, and for one AST, executing the following three steps:

1) Recursively traversing each node on the AST;

2) If the current node relates to variable definition or reference, saving variable definition information, variable reference information and variable scope information;

3) If the current node relates to memory allocation or release, storing memory allocation information and memory release information;

analysis three, series-parallel domain analysis: for one CFG, the following three steps are performed:

1) Recursively traversing each node on the CFG, and storing relationship information between the nodes;

2) If the current node is in the action domain of the OpenMP # pragma parallel instruction statement, marking the current node as a parallel node, and storing the type information and the range information of the current node;

3) If the current node is not in the action domain of the OpenMP # pragma parallel instruction statement, marking the current node as a serial node, and storing the type information and the range information of the serial node;

through the three static analyses described above, the following information can be obtained: function call relation information, variable definition information, variable reference information, variable scope information, memory allocation information, memory release information, serial-parallel node type information, serial-parallel node range information and inter-node relation information; this information will provide support for the serial-parallel control flow graph setup.

Further, the serial-parallel control flow graph in the information collection module is defined as follows:

defining a serial-parallel control flow graph to be composed of serial nodes, parallel nodes and directed edges among the nodes; the serial node represents a code segment which is outside the scope of action of an OpenMP # pragma parallel instruction statement, has no branch inside and is executed in series; executing the code segments corresponding to the serial nodes on the CPU, wherein the serial nodes are also marked as SEQ nodes;

the parallel node represents a code segment which is in the action domain of the OpenMP # pragma parallel instruction statement and is executed in parallel; unloading the code segments corresponding to the parallel nodes to a GPU for execution, wherein the parallel nodes are also marked as OMP nodes;

function calling information and variable related information are saved in the serial node and the parallel node;

and the directed edges among the nodes represent the sequential relation of the execution of the code segments corresponding to the nodes.

Further, the process of establishing a serial-parallel control flow graph in the information collection module comprises the following steps:

the establishing process of the serial-parallel control flow graph takes a function as a basic processing unit; for the whole source program, if a serial-parallel control flow graph of a function can be established, combining the collected function call relation information, the serial-parallel control flow graph of the whole source program can be recursively established;

for a function, based on the collected information, a serial-parallel control flow graph can be built by:

1) Establishing a serial node and a parallel node which are isolated one by using the node type information and the node range information;

2) Establishing directed edges among the nodes by using the relationship information among the nodes, and connecting the serial nodes and the parallel nodes into a graph;

3) And storing the function call information, the variable definition information, the variable reference information, the variable scope information, the memory allocation information and the memory release information to the corresponding serial node or parallel node according to the node range information.

Further, the complex pointer data is divided into three types according to the position of the pointer:

the device comprises a class nesting pointer, a vector container and a multi-level nesting pointer;

the class nesting pointer refers to a pointer contained in a class, and the vector container refers to a vector container provided in a C + + standard library; the multi-level nested pointers refer to pointers at two levels and pointers above two levels.

Further, the class nesting pointer is processed by:

(1) recursively traversing the established serial-parallel control flow graph, and finding out the class nesting pointers which are defined in the serial nodes and are referred in the parallel nodes and the C + + classes thereof according to the variable definitions and reference information stored in the serial/parallel nodes;

(2) modifying the type definition of the type in the source code for the C + + type found in the step (1) so that the type of the type inherits the UMMaper base class provided by the runtime during definition;

(3) modifying the memory allocation and release statement of the pointer in the source code for the class nested pointer found in the step (1), allocating the memory by using cudaMallocManager () and releasing the memory by using cudaFree ();

(4) and (2) modifying the definition statement of the type instance in the source code for the C + + class instance found in the step (1), creating the class instance by using a reloaded new operator, and transmitting the memory space address distributed in the step (3) to a corresponding nested pointer in the C + + class instance.

Further, the vector container is processed by the following steps:

(1) recursively traversing the established serial-parallel control flow graph, and finding out a vector container which is defined in the serial node and is referred in the parallel node according to variable definition and reference information stored in the serial/parallel node;

(2) and (3) modifying the definition statement of the vector container instance found in the step (1), and inserting a display call for a custom allocator provided by the runtime.

Further, the multi-level nested pointer is processed by:

(1) recursively traversing the established serial-parallel control flow graph, and finding out multi-level nested pointers defined in serial nodes and referred in parallel nodes and sub pointers of each level thereof according to variable definitions and reference information stored in the serial/parallel nodes;

(2) and (2) modifying memory allocation and release statements of all the multi-level nested pointers and all the levels of sub pointers found in the step (1), allocating memory by using cudaMallocManager () and releasing memory by using cudaFree ().

Further, the implementation process of the ummale class in the runtime module is as follows: designing a UMMaper class to manage the memory allocation and release of the C + + class for processing class nested pointers; reloading a C + + default new and delete operator by using cudamallmanaged () and cudaFree () in the UMMaper class; the allocation, release and data transmission operations of the memory space of the derivative classes of the UMMaper class on the CPU and the GPU can be automatically managed by the unified memory.

Further, the implementation process of the custom allocator class in the runtime module is as follows: designing a self-defined allocator class to manage memory allocation and release of the vector container for processing the vector container; the vector container defaults to use an allocator space configurator of a C + + standard library to manage memory allocation and release, so that a user-defined allocator class can be realized based on a uniform memory, and automatic management of memory allocation and release of the vector container is realized; in the user-defined allocator class, realizing a _ allocator () function in the allocator class based on cudaMallocManager (), and realizing a _ deallocate () function in the allocator class based on cudaFree (); and displaying and calling the custom allocator class when the vector container declares, so that the allocation and release of the memory space of the vector container on the CPU and the GPU and the automatic management of data transmission operation can be realized by the unified memory.

The invention has the following beneficial technical effects:

the system can automatically modify the memory allocation and release statements, automatically insert related pragma primitives and automatically manage data transmission between the CPU and the accelerator on the basis of the OpenMP CPU program, so that the data consistency of the program containing a complex pointer data structure on a heterogeneous computing platform is ensured, and the program performance is improved.

The heterogeneous programming scheme researched by the invention mainly aims at the automatic management of a complex pointer data structure in an OpenMP program, and specifically aims at automatically modifying memory allocation and release statements, automatically inserting relevant pragma primitives and automatically managing the data transmission between a CPU and an accelerator; therefore, the data consistency of the program on the heterogeneous computing platform is ensured, and the program performance is improved.

Since openmpoffloding supports accelerator programming, openMP code on a CPU can be offloaded to a GPU for execution, and objective conditions allow offloading of OpenMP code. Under the condition, the OpenMP code is unloaded to the GPU for running, so that on one hand, the running efficiency of the program can be improved, and on the other hand, the acceleration effect of the GPU can be fully utilized. However, manual code unloading cannot ensure the correctness of the conversion program, consumes manpower and material resources, and is very low in efficiency. Therefore, the invention provides an automatic management scheme of the complex pointer data structure, so as to solve the problems of memory allocation, data transmission and the like of the most difficult complex pointer data structure between the CPU and the accelerator in the unloading process.

Through comparison experiments on a general test set (PolyBench, rodinia and the like), the method can automatically manage a complex pointer data structure in the OpenMP off-floating program, ensure the correctness of the program and improve the performance of the program.

The complex pointer data structure described in the present invention refers to a complex pointer data type.

Drawings

FIG. 1 is an overall framework of the system of the present invention; fig. 2 is a block diagram of a structure of an AST analysis method; FIG. 3 is a flow diagram of a function string parallel control; FIG. 4 is a unified memory schematic; FIG. 5 is a schematic diagram of an implementation of automatic offload of handling class nested pointers (program comparison of whether unified memory is used); FIG. 6 is a schematic diagram of an automatic unload implementation of a process vector container (procedural comparison of whether a space configurator is used or not); FIG. 7 is a schematic diagram of an implementation of automatic offload processing of multi-level nested pointers; FIG. 8 is a block diagram of the overall experimental design of the present invention;

FIG. 9 is a histogram of runtime at 2080Ti platform Large data volume, with: FIG. 9 (a) is a Large-2080Ti runtime comparison graph (data set a), FIG. 9 (b) is a Large-2080Ti runtime comparison graph (data set b), and FIGS. 9 (a) and 9 (b) are essentially one graph, which is divided into two graphs because there are many data sets;

FIG. 10 is a run time comparison of the Rodinia test set on 2080 Ti; FIG. 11 is a comparison graph of acceleration ratios for different data volumes for Polybench-OAO (2080 Ti); FIG. 12 is a comparison of Polybench acceleration ratios (2080 Ti); FIG. 13 is an on-stream overhead graph (2080 Ti); FIG. 14 is a detail test chart of the complex pointer data structure (K40);

FIG. 15 is an example of program conversion with vector; FIG. 16 is an example of program conversion including structnest; FIG. 17 is an example of program conversion with multiple levels. In fig. 15 to 17: vector is a vector container in c + +; structnest is a class nesting pointer; multilevel is a multi-level nested pointer.

Detailed Description

With reference to fig. 1 to 17, the following description is made on an automatic management system for a complex pointer data structure oriented to a heterogeneous platform according to the present invention:

the main task of the invention is to automatically manage the complex pointer data structures (class nesting pointers, vector containers, multilevel nesting pointers and the like) in the OpenMPOfflooding program, namely, to realize automatic modification, distribution and release of statements, automatically manage data transmission and ensure data consistency. The invention mainly comprises the following three modules:

the information collection module has two functions: 1) Performing static analysis on the source program to collect program information; 2) And establishing an abstract representation, namely a serial-parallel control flow graph, for the source program based on the collected information. The working process of the module comprises the following two steps: 1) Generating a corresponding Abstract Syntax Tree (AST) and a Control Flow Graph (CFG) from a C or C + + source code through a Clang compiler, traversing the AST and the CFG, distinguishing a serial-parallel domain, and acquiring detailed program information; 2) And generating a serial-parallel control flow graph according to the information.

And the automatic conversion module is mainly responsible for inserting a runtime API into the source code based on the serial-parallel control flow graph so as to complete code conversion. Firstly, determining the type of a complex pointer variable according to complex pointer variable information stored in a serial-parallel control flow diagram. And then according to different types, inserting a proper runtime API into a proper position in the source code to finish code conversion. Therefore, the memory allocation, release and data transmission operations related to the complex pointer variables are all taken over by the runtime, so that the complex pointer variables can be automatically managed by the runtime, and the data consistency between the CPU and the GPU is ensured.

The runtime module is mainly responsible for realizing the following operations of the complex pointer data structure based on the unified memory: memory allocation and release operations on the CPU and the GPU, and automatic data transmission operation between the CPU and the GPU; wherein C + + default interfaces of new, delete, and allocator are re-implemented using cudaMallocManaged () and cudaFree (), and an interface for memory allocation and release is provided to the outside in the form of an API interface.

The result is to convert the source program into a new program with runtime API, and the overall framework of the system is shown in fig. 1.

1 information collecting module

The information collection module implements two functions: statically analyzing and collecting program information; and establishing a serial-parallel control flow graph.

1.1 information Collection

Firstly, a Clang compiler is used to perform lexical analysis and semantic analysis on a target program, and an Abstract Syntax Tree (AST) and a Control Flow Graph (CFG) are generated.

And then, performing the following three static analyses on the AST and the CFG to obtain function call relation information, variable related information and serial/parallel domain related information, and providing information support for the subsequent serial-parallel control flow diagram establishment and code conversion.

Analysis one, function call relation analysis: for an AST, the following two steps are performed:

1) Recursively traversing each node on the AST;

2) If the current node is the function definition node, the function name and the sub-function information of the function call are saved.

Analysis two, variable information analysis, for an AST, the following three steps of work are performed:

1) Recursively traversing each node on the AST;

3) And if the current node relates to memory allocation or release, storing memory allocation information and memory release information.

And thirdly, analyzing a serial-parallel domain: for one CFG, the following three steps are performed:

1) Recursively traversing each node on the CFG, and storing relationship information among the nodes;

3) And if the current node is not in the scope of the OpenMP # pragma parallel instruction statement, marking the current node as a serial node, and storing the type information and the range information of the serial node.

Through the three static analyses described above, the following information can be obtained: function call relation information, variable definition information, variable reference information, variable scope information, memory allocation information, memory release information, serial-parallel node type information, serial-parallel node range information and inter-node relation information. This information will provide support for the serial-parallel control flow graph setup.

1.2 Serial-to-parallel control flow graph establishment

The defined serial-parallel control flow graph consists of serial nodes, parallel nodes and directed edges between the nodes. The serial node represents a code segment which is outside the scope of action of an OpenMP # pragma parallel instruction statement, has no branch inside and is executed serially; the code segments corresponding to the serial nodes are executed on the CPU, and the serial nodes are also denoted as SEQ nodes.

The parallel node represents a code segment which is in the OpenMP # pragma parallel instruction statement action domain and is executed in parallel; and unloading the code segments corresponding to the parallel nodes to the GPU for execution, wherein the parallel nodes are also named as OMP nodes. The serial nodes and the parallel nodes store function calling information and variable related information. And the directed edges among the nodes represent the sequential relation of the execution of the code segments corresponding to the nodes.

The establishing process of the serial-parallel control flow graph takes a function as a basic processing unit; for the whole source program, if a serial-parallel control flow graph of a function can be established, the serial-parallel control flow graph of the whole source program can be recursively established by combining function call relation information.

For a function, based on collected program information, a serial-parallel control flow graph may be built by:

2 automatic conversion module

The automatic conversion module divides the complex pointer data types into: the device comprises three types of a class nesting pointer, a vector container and a multi-level nesting pointer.

The key technology in the complex pointer data structure processing is Unified Memory (UM), so the concept and principle of Unified Memory are introduced first, UM maintains a Unified Memory pool, which is shared by CPU and GPU, and only once Memory is allocated, and this data pointer is available to both host side (CPU) and device side (GPU). A single pointer is used to host memory so that data transfers between different devices are done automatically by the UM's runtime system and allow the GPU to process data sets that exceed its memory capacity. The unified memory principle is shown in fig. 4.

Different processing methods are respectively designed aiming at three complex pointer data types by utilizing a unified memory technology.

2.1 class nested pointers

The class nesting pointer refers to a pointer included in a class, and is processed by the following steps as shown in fig. 5:

(1) and recursively traversing the serial-parallel control flow graph, and finding out the class nesting pointers defined in the serial nodes and referred to in the parallel nodes and the classes thereof according to the variable definitions and reference information stored in the serial/parallel nodes.

(3) Modifying the type definition of the class in the source code for the type of the class found in the step (1) so that the type of the class inherits the UMMaper base class provided by the runtime (see runtime design) during definition;

(3) modifying the memory allocation and release statement of the pointer in the source code for the class nested pointer found in the step (1), allocating the memory by using cudaMallocManager (), and releasing the memory by using cudaFree ();

(4) and (2) modifying the definition statement of the class instance in the source code for the class instance found in the step (1), creating the class instance by using the reloaded new operator, and transmitting the memory space address distributed in the step (3) to a corresponding nested pointer in the class.

An example of the processing of class nesting pointers is shown in FIG. 5.

2.2vector Container

The vector container refers to a vector container provided in a C + + standard library. The vector container needs to be processed by the following steps:

(1) and recursively traversing the serial-parallel control flow graph, and finding out the vector container which is defined in the serial node and is referred to in the parallel node according to the variable definition and the reference information stored in the serial/parallel node.

(2) And (3) modifying the definition statement of the vector container instance found in the step (1), and inserting a display call for a custom allocator (see runtime design) provided by the runtime.

A vector container processing example is shown in fig. 6.

2.3 Multi-level nested pointers

The multi-level nested pointers refer to pointers at two levels and pointers above two levels. The multi-level nested pointers need to be processed by:

(1) recursively traversing the serial-parallel control flow graph, and finding out multi-level nested pointers defined in serial nodes and referred in parallel nodes and sub pointers of each level thereof according to variable definitions and reference information stored in the serial/parallel nodes;

An example of the processing of a multi-level nested pointer is shown in FIG. 7.

3 run time

In order to make the converted code closer to the format of the source code and reduce the modification of the source code as much as possible, the scheme of inserting a small amount of runtime APIs into the original program is proposed herein, and memory allocation based on a unified memory is encapsulated into runtime.

For classes and their nested pointers, the UMMapper base class responsible for memory allocation and release is implemented in runtime, as shown in fig. 5. The C + + default new, delete operator is reloaded in UMMapper using cudamallmanager () and cudaFree (). Therefore, only the UMmapper derivation is needed during class declaration, and new class instance is built by using new, so that the memory allocation, release and data transmission operations related to the derived classes can be automatically managed through unified memory.

The vector container uses an allocator space configurator of a C + + standard library to manage memory allocation and release, so that a user-defined allocator class can be realized based on a uniform memory, and automatic management of memory allocation and release of the vector container is realized. In the custom allocator class, a _ allocator () function in the allocator class is implemented based on cudamallocManager (), and a _ deallocate () function in the allocator class is implemented based on cudaFree (). When the vector container is declared, the user-defined allocator class is displayed and called, so that the allocation and release of the memory space of the vector container on the CPU and the GPU and the automatic management of the data transmission operation by the unified memory can be realized, as shown in fig. 6.

The technical effects of the invention are verified as follows:

to verify and analyze the effect of the code off-loading scheme, we performed tests on the RTX 2080Ti platform, using a benchmarking dataset such as PolyBench, rodinia, etc., comparing the test results with another well-known source-to-source compiler (DawnCC) and Manual translation (Manual) results and CPU-parallel code. Our scheme is named OAO (OpenMP Automated Offlowingwith complex data structure Resupport).

The test and the acquisition of the experimental result need to be integrated and run, and the four stages of script running, python extraction and originLab analysis are required. The whole experimental process is shown in fig. 8.

1 run time

The run times for 2080Ti platform Large data volumes are shown in FIG. 9, respectively.

The Rodinia test set was run time compared on 2080Ti as shown in fig. 10.

From the above run-time collection and comparison of data sets, the following conclusions can be drawn:

(1) As can be seen from the runtime diagram, OAO can handle all 23 test programs, while DawnCC can only handle 15 test programs. The OAO has improved performance on 9 programs of a K40 platform and 15 programs of a 2080Ti platform; and the OAO has the best performance on all test programs for all platforms relative to manual conversion.

(2) In Rodinia-2080Ti, it is known that the amount of Rodinia data was obtained from pre-experimental tests, and in total 8 data sets, 5 data sets had shorter OAO runtime than OMP.

(3) In Polybench-Large, the running time of OAO is better than that of DawnCC, manual and native; runtime as a whole shows a trend of OAO < DawnCC < Manual < native. The native has the longest running time because of the problem of the selected transmission statement, and parallel optimization is not performed; manual contains redundant transmission of partial data, and the running time is longer than OAO; OAO is also generally superior to DawnCC in the processing of data sets.

2 acceleration ratio

As can be seen from fig. 11, in the aspect of the acceleration ratio, in the data larger than 1, the Large data amount acceleration ratio is larger than the Medium, which is also in line with the characteristic of parallel acceleration, and the larger the data amount is processed, the more obvious the parallel acceleration effect is.

In the aspect of acceleration ratio, the calculation mode is the ratio of OMP/Type, the Type is other types, the result uses 1X as the boundary, the upper part is positive acceleration, and the lower part is negative acceleration. As can be seen from fig. 12, the results of all test programs show that the acceleration ratio of OAO is better than Manual on any platform, and among the Polybench test programs, OAO is slightly better than DawnCC, and because DawnCC cannot process the rodia test program, OAO has the best data transmission optimization effect on all test programs, and can process complex test programs which cannot be processed by DawnCC, and the application range is wider.

3 runtime overhead

In the Polybench test set, the overhead at runtime is shown in fig. 13.

Compared with the runtime overhead, the runtime overhead of all test programs is smaller by more than 2 orders of magnitude, so the runtime overhead of the OAO is very small and can be almost ignored, and the performance of the source program is not lost when the OAO is added into the runtime.

4 unified memory

The detailed test of the unified memory comprises a vector container, a class nesting pointer (structnest) and a multilevel nesting pointer (multilevel). The runtime and data transfer time pairs for complex pointer data structures are shown in fig. 14.

From the perspective of data transfer time, the time of H2D and D2H is much less than the total runtime, and all complex pointer data structures can be processed correctly, so the OAO compiler can correctly support GPU offload of complex pointer data structures.

The simple conversion program comparison among vector container, structnest and multilevel is shown in fig. 15, 16 and 17, respectively.

Wherein the statement phase vector variable is modified to QY:: allocator mode override, as shown in FIG. 15; in FIG. 16, the declaration phase class inside nesting pointers x, y, z are reloaded by cudaMallocManged, and the instantiated class a is indicated by a transport statement, adding map (from: a [0 ]); in FIG. 17, x is the secondary pointer, y is the primary pointer of x, and it is declared that phase x, y is reloaded by cudaMallocManagered.

Table 1 unified memory test results table

The unified memory test results are shown in table 1. In the data transmission process, before formal transmission, one transmission is preferentially performed, the data size of H2D is 4B, the data size of D2H is 1B (4 +1 mode), bidirectional transmission of vector in the above table, bidirectional transmission of 52 greater than 49 by 3, bidirectional transmission of structnest by 36 greater than 33 by 3 are all due to this reason. In addition, the vector is transmitted only 2 times, X and Y respectively, the data size of each time is 24B, the structnest is transmitted 1 time, the data size is A, and the data size is 32.

In the multilevel, z is a unidirectional transmission and uses a data transmission model, and x and y are multi-level nested pointers and use a unified memory. Except for the 4+1 mode with priority transmission, the reason why there is a difference in the amount of data transmission between H2D and D2H is that z does not go back and x goes back.

Experimental results prove that the OAO can correctly process the complex pointer data structure by using the unified memory, and the support for processing the complex pointer data structure is realized.

Claims

1. An automatic management system of a complex pointer data structure facing a heterogeneous platform is characterized in that the system is used for realizing automatic management of the complex pointer data structure in an OpenMPOfflooding program on the heterogeneous computing platform and ensuring data consistency;

the system comprises the following three modules:

the information collection module has two functions: 1) Performing static analysis on the source program to collect program information; 2) Establishing an abstract representation, namely a serial-parallel control flow diagram, for a source program based on the collected information;

the working process of the module comprises the following two steps: 1) Generating a corresponding abstract syntax tree AST and a control flow graph CFG from a C or C + + source code through a Clang compiler, traversing the AST and the CFG, distinguishing a serial-parallel domain, and acquiring detailed program information; 2) Generating a serial-parallel control flow graph according to the information;

the automatic conversion module is mainly responsible for inserting an API (application program interface) in operation into a source code based on a serial-parallel control flow graph so as to complete code conversion; firstly, determining the type of complex pointer data according to complex pointer variable information stored in a serial-parallel control flow diagram; then according to different types, inserting proper runtime API in proper position in the source code to complete code conversion; in this way, the memory allocation, release and data transmission operations related to the complex pointer variable are all taken over by the runtime, so that the complex pointer variable can be automatically managed by the runtime, and the data consistency between the CPU and the GPU is ensured;

the runtime module is mainly responsible for realizing the following operations of the complex pointer data type based on the unified memory: memory allocation and release operations on the CPU and the GPU and automatic data transmission operations between the CPU and the GPU; the runtime module consists of a ummale class and an allocator class, and the ummale class and the allocator class provide interfaces for memory allocation and release to the outside in the form of API interfaces.

2. The system for automatically managing the complex pointer data structure oriented to the heterogeneous platform according to claim 1, wherein the information collection function of the information collection module is implemented by:

firstly, performing lexical analysis and semantic analysis on a target program by using a Clang compiler, and generating an abstract syntax tree AST and a control flow graph CFG;

then, the AST and the CFG are subjected to the following three static analyses to obtain function call relation information, variable related information and serial/parallel domain related information, and information support is provided for the establishment of a serial-parallel control flow graph and code conversion;

1) Recursively traversing each node on the AST;

2) If the current node is a function definition node, storing the function name and the sub-function information called by the function;

1) Recursively traversing each node on the AST;

3. The automatic management system for the complex pointer data structure oriented to the heterogeneous platform according to claim 1 or 2, characterized in that the serial-parallel control flow graph in the information collection module is defined as follows:

defining a serial-parallel control flow graph to be composed of serial nodes, parallel nodes and directed edges among the nodes; the serial node represents a code segment which is outside the scope of action of an OpenMP # pragma parallel instruction statement, has no branch inside and is executed serially; executing the code segments corresponding to the serial nodes on the CPU, wherein the serial nodes are also marked as SEQ nodes;

the parallel node represents a code segment which is in the OpenMP # pragma parallel instruction statement action domain and is executed in parallel; unloading the code segments corresponding to the parallel nodes to a GPU for execution, wherein the parallel nodes are also marked as OMP nodes;

4. The system according to claim 3, wherein the process of establishing the serial-parallel control flow graph in the information collection module comprises:

the establishing process of the serial-parallel control flow graph takes a function as a basic processing unit; for the whole source program, if a serial-parallel control flow graph of a function can be established, combining the function call relation information collected in claim 2, the serial-parallel control flow graph of the whole source program can be recursively established;

for a function, based on the information gathered in claim 2, a serial-parallel control flow graph can be built by:

5. The automatic management system for the complex pointer data structure oriented to the heterogeneous platform according to claim 4, wherein the complex pointer data is classified into three types according to the position of the pointer:

6. The heterogeneous platform-oriented automatic management system for the complex pointer data structure of the claim 5 is characterized in that the class nesting pointer is processed by the following steps:

(1) recursively traversing the serial-parallel control flow graph established in the claim 4, and finding out the class nesting pointers defined in the serial nodes and referred in the parallel nodes and the C + + classes thereof according to the variable definitions and reference information stored in the serial/parallel nodes;

(4) and (3) modifying the definition statement of the type instance in the source code for the C + + type instance found in the step (1), creating the type instance by using a reloaded new operator, and transmitting the memory space address distributed in the step (3) to a corresponding nested pointer in the C + + type instance.

7. The system according to claim 5, wherein the vector container is processed by the following steps:

(1) recursively traversing the serial-parallel control flow graph established in claim 4, and finding a vector container defined in the serial node and referred to in the parallel node according to the variable definition and reference information stored in the serial/parallel node;

8. The heterogeneous platform-oriented complex pointer data structure automatic management system according to claim 5, wherein the multi-level nested pointers are processed by the following steps:

(1) recursively traversing the serial-parallel control flow graph established in claim 4, and finding out multi-level nested pointers defined in the serial nodes and referred to in the parallel nodes and sub-pointers of the levels thereof according to variable definitions and reference information stored in the serial/parallel nodes;

9. The system for automatically managing the complex pointer data structure oriented to the heterogeneous platform as claimed in claim 6, wherein the implementation process of the ummale class in the runtime module is as follows:

designing a UMMaper class to manage the memory allocation and release of the C + + class for processing the class nesting pointers; reloading a C + + default new and delete operator by using cudamallmanaged () and cudaFree () in the UMMaper class; the allocation, release and data transmission operations of the derived classes of the UMMaper class on the memory space of the CPU and the GPU can be automatically managed by the unified memory.

10. The system according to claim 7, wherein the implementation process of the custom allocator class in the runtime module is as follows:

designing a self-defined allocator class to manage memory allocation and release of the vector container for processing the vector container; the vector container defaults to use an allocator space configurator of a C + + standard library to manage memory allocation and release, so that a user-defined allocator class can be realized based on a uniform memory, and automatic management of memory allocation and release of the vector container is realized; in the custom allocator class, realizing a function of _ allocator () in the allocator class based on cudaMallocManager (), and realizing a function of _ deallocate () in the allocator class based on cudaFree (); and displaying and calling the custom allocator class when the vector container declares, so that the allocation and release of the memory space of the vector container on the CPU and the GPU and the automatic management of data transmission operation can be realized by the unified memory.