CN112083956A

CN112083956A - Heterogeneous platform-oriented automatic management system for complex pointer data structure

Info

Publication number: CN112083956A
Application number: CN202010971038.9A
Authority: CN
Inventors: 张伟哲; 何慧; 王法瑞; 方滨兴; 郝萌; 郭浩男
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2020-12-15
Anticipated expiration: 2040-09-15
Also published as: CN112083956B

Abstract

An automatic management system for a complex pointer data structure facing a heterogeneous platform relates to the technical field of heterogeneous programming. The invention aims to realize the automatic management of a complex pointer data structure in an OpenMP Offloading program on a heterogeneous computing platform and ensure the data consistency. The invention comprises the following steps: the information collection module is used for carrying out static analysis on the source program and collecting program information; the automatic conversion module is mainly responsible for modifying the source code at a proper position according to different variable types and inserting a proper runtime API; and the runtime module is mainly responsible for realizing the memory management operation of the C + + standard again by using the cudaMallocManged () and the cudaFree () and providing an interface outwards. The invention can automatically manage the memory allocation, release and data transmission of the complex pointer data structure in the OpenMP offload program between the CPU and the GPU memory, and ensure the data consistency; therefore, convenience is provided for the development of the OpenMP Offloading program.

Description

Heterogeneous platform-oriented automatic management system for complex pointer data structure

Technical Field

The invention relates to an automatic management system for a complex pointer data structure in an OpenMP off-loading program, and relates to the technical field of heterogeneous programming.

Background

OpenMP was introduced by the OpenMP Architecture Review Board and has been widely accepted as a set of instructive compilation processing schemes (Compiler Directive) for multiprocessor programming of shared memory parallel systems [1 ]. The OpenMP supported programming languages include C, C + + and Fortran; while OpenMP-enabled compilers include Sun Compiler, GNU Compiler, and Intel Compiler, among others. OpenMP provides a high-level abstract description of the parallel algorithm, and programmers specify their intentions by adding special pragma primitives to the source code, so that the compiler can automatically parallelize the program and add synchronization mutual exclusion and communication where necessary.

In the field of high performance computing, various accelerators (e.g., GPU, FPGA, DSP, MLU, etc.) have become a significant source of computing power in addition to CPU. From version 4.0, OpenMP adds Offloading, which supports the heterogeneous programming model of CPU + accelerator; through the development of versions 4.5 and 5.0, OpenMP offload is gradually improved. OpenMP Offloading provides the possibility for OpenMP programs to fully utilize computing power of heterogeneous computing platforms, but modifying existing OpenMP CPU programs into programs that conform to Offloading characteristics is still a difficult, tedious, and error-prone task, especially when the programs contain complex pointer data structures, such as: class nesting pointers, vector containers, multilevel nesting pointers, etc.

Although the OpenMP Offloading syntax is simple, it still requires the user to manage the data transfer operation between the CPU and the accelerator with the associated pragma primitive display, which causes great inconvenience to the developer, especially when encountering complex pointer data structures. For example, in a vector container in C + +, memory allocation is implicit, and it is difficult for a developer to control memory allocation and data transmission, and it is also difficult to use the Offloading property. For the nested pointer or the multi-level nested pointer, the task of processing the allocation, release and data transmission of the memory pointed by the pointers of different levels in the memory spaces of the CPU and the accelerator is tedious and very easy to make mistakes, which also makes developers have a disadvantage in terms of the overflowing characteristic.

The CUDA programming model of the NVIDIA GPU platform supports Unified Memory (UM) characteristics from version 6.0; this feature unifies the address spaces of the CPU and GPU and automatically manages data transmission between the CPU and GPU. The unified memory characteristic provides a possible technical means for the automatic processing of a complex pointer data structure in the development of an OpenMP offload program.

Disclosure of Invention

The technical problem to be solved by the invention is as follows:

the invention aims to provide an automatic management system for a complex pointer data structure in an OpenMP offload program oriented to a heterogeneous platform, and the automatic management system is used for solving the problems that in the prior art, on the basis of an OpenMP CPU program, memory allocation and release statements cannot be automatically modified and relevant pragma primitives cannot be automatically inserted, data transmission between a CPU and an accelerator cannot be automatically managed, and data consistency of the program containing the complex pointer data structure on the heterogeneous computing platform cannot be ensured, so that the performance of the program is influenced.

The technical scheme adopted by the invention for solving the technical problems is as follows:

an automatic management system of a complex pointer data structure facing a heterogeneous platform is used for realizing the automatic management of the complex pointer data structure in an OpenMP off-loading program on the heterogeneous computing platform and ensuring the data consistency;

the system comprises the following three modules:

the information collection module has two functions: 1) performing static analysis on the source program to collect program information; 2) establishing an abstract representation, namely a serial-parallel control flow diagram, for a source program based on the collected information;

the working process of the module comprises the following two steps: 1) generating a corresponding Abstract Syntax Tree (AST) and a Control Flow Graph (CFG) from a C or C + + source code through a Clang compiler, traversing the AST and the CFG, distinguishing a serial-parallel domain, and acquiring detailed program information; 2) generating a serial-parallel control flow graph according to the information;

the automatic conversion module is mainly responsible for inserting an API (application program interface) in the running process into the source code based on the serial-parallel control flow graph so as to complete code conversion; firstly, determining the type of complex pointer data according to complex pointer variable information stored in a serial-parallel control flow diagram; then according to different types, inserting proper runtime API in proper position in the source code to complete code conversion; in this way, the memory allocation, release and data transmission operations related to the complex pointer variable are all taken over by the runtime, so that the complex pointer variable can be automatically managed by the runtime, and the data consistency between the CPU and the GPU is ensured;

the runtime module is mainly responsible for realizing the following operations of the complex pointer data type based on the unified memory: memory allocation and release operations on the CPU and the GPU and automatic data transmission operations between the CPU and the GPU; the runtime module is composed of a UMMaper class and an allocator class, and the UMMaper class and the allocator class provide interfaces for memory allocation and release to the outside in the form of API interfaces.

Further, the information collection function of the information collection module is realized by the following steps:

firstly, performing lexical analysis and semantic analysis on a target program by using a Clang compiler, and generating an Abstract Syntax Tree (AST) and a Control Flow Graph (CFG);

then, performing the following three static analyses on the AST and the CFG to obtain function call relation information, variable related information and serial/parallel domain related information, and providing information support for the subsequent serial-parallel control flow diagram establishment and code conversion;

analysis one, function call relation analysis: for an AST, the following two steps are performed:

1) recursively traversing each node on the AST;

2) if the current node is a function definition node, storing the function name and the sub-function information called by the function;

analysis two, variable information analysis, for an AST, the following three steps of work are performed:

1) recursively traversing each node on the AST;

2) if the current node relates to variable definition or reference, saving variable definition information, variable reference information and variable scope information;

3) if the current node relates to memory allocation or release, storing memory allocation information and memory release information;

analysis three, series-parallel domain analysis: for one CFG, the following three steps are performed:

1) recursively traversing each node on the CFG, and storing relationship information between the nodes;

2) if the current node is in the action domain of the OpenMP # pragma parallel instruction statement, marking the current node as a parallel node, and storing the type information and the range information of the current node;

3) if the current node is not in the action domain of the OpenMP # pragma parallel instruction statement, marking the current node as a serial node, and storing the type information and the range information of the serial node;

through the three static analyses described above, the following information can be obtained: function call relation information, variable definition information, variable reference information, variable scope information, memory allocation information, memory release information, serial-parallel node type information, serial-parallel node range information and inter-node relation information; this information will provide support for the serial-parallel control flow graph setup.

Further, the serial-parallel control flow graph in the information collection module is defined as follows:

defining a serial-parallel control flow graph to be composed of serial nodes, parallel nodes and directed edges among the nodes; the serial node represents a code segment which is outside the scope of action of an OpenMP # pragma parallel instruction statement, has no branch inside and is executed in series; executing the code segments corresponding to the serial nodes on the CPU, wherein the serial nodes are also marked as SEQ nodes;

the parallel node represents a code segment which is in the action domain of the OpenMP # pragma parallel instruction statement and is executed in parallel; unloading the code segments corresponding to the parallel nodes to a GPU for execution, wherein the parallel nodes are also marked as OMP nodes;

function calling information and variable related information are saved in the serial node and the parallel node;

and the directed edges among the nodes represent the sequential relation of the execution of the code segments corresponding to the nodes.

Further, the establishing process of the serial-parallel control flow graph in the information collecting module is as follows:

the establishing process of the serial-parallel control flow graph takes a function as a basic processing unit; for the whole source program, if a series-parallel control flow graph of a function can be established, the series-parallel control flow graph of the whole source program can be recursively established by combining the collected function call relation information;

for a function, based on the collected information, a serial-parallel control flow graph can be built by:

1) establishing a serial node and a parallel node which are isolated one by using the node type information and the node range information;

2) establishing directed edges among the nodes by using the relationship information among the nodes, and connecting the serial nodes and the parallel nodes into a graph;

3) and storing the function call information, the variable definition information, the variable reference information, the variable scope information, the memory allocation information and the memory release information to the corresponding serial node or parallel node according to the node range information.

Further, the complex pointer data is divided into three types according to the position of the pointer:

a class nesting pointer, a vector container and a multilevel nesting pointer;

the class nesting pointer refers to a pointer contained in a class, and the vector container refers to a vector container provided in a C + + standard library; the multi-level nested pointers refer to pointers at two levels and pointers above two levels.

Further, the class nesting pointer is processed by:

recursively traversing the established serial-parallel control flow graph, and finding out class nested pointers which are defined in serial nodes and are referred in parallel nodes and C + + classes of the class nested pointers according to variable definitions and reference information stored in the serial/parallel nodes;

secondly, modifying the type definition of the type in the source code for the C + + type found in the step I, so that the type inherits the UMMaper base class provided during operation;

modifying the memory allocation and release statement of the pointer in the source code for the class nested pointer found in the step one, allocating the memory by using cudaMallocManager () and releasing the memory by using cudaFree ();

and fourthly, modifying the definition statement of the instance in the source code for the C + + instance found in the step III, creating the instance by using a reloaded new operator, and transmitting the memory space address distributed in the step III to a corresponding nested pointer in the C + + instance.

Further, the vector container is processed by the following steps:

recursively traversing the established serial-parallel control flow graph, and finding out a vector container which is defined in a serial node and is referred in a parallel node according to variable definition and reference information stored in the serial/parallel node;

and secondly, modifying the definition statement of the vector container instance found in the step I, and inserting display call of the custom allocators provided by the runtime.

Further, the multi-level nested pointer is processed by:

recursively traversing the established serial-parallel control flow graph, and finding out a multi-level nested pointer which is defined in a serial node and is referred in a parallel node and sub pointers of each level of the multi-level nested pointer according to variable definition and reference information stored in the serial/parallel node;

and secondly, modifying memory allocation and release statements of all the multi-level nested pointers and all the levels of sub pointers found in the step I, allocating memory by using cudaMallocManager () and releasing memory by using cudaFree ().

Further, the implementation process of the ummale class in the runtime module is as follows: designing a UMMaper class to manage the memory allocation and release of the C + + class for processing the class nesting pointers; reloading a C + + default new and delete operator by using cudamallmanaged () and cudaFree () in the UMMaper class; the allocation, release and data transmission operations of the derived classes of the UMMaper class on the memory space of the CPU and the GPU can be automatically managed by the unified memory.

Further, the implementation process of the custom allocator class in the runtime module is as follows: designing a self-defined allocator class to manage memory allocation and release of the vector container for processing the vector container; the vector container defaults to use an allocator space configurator of a C + + standard library to manage memory allocation and release, so that a user-defined allocator class can be realized based on a uniform memory, and automatic management of memory allocation and release of the vector container is realized; in the user-defined allocator class, realizing a _ allocator () function in the allocator class based on cudaMallocManager (), and realizing a dealloate () function in the allocator class based on cudaFree (); and displaying and calling the custom allocator class when the vector container declares, so that the allocation and release of the memory space of the vector container on the CPU and the GPU and the automatic management of data transmission operation can be realized by the unified memory.

The invention has the following beneficial technical effects:

the system can automatically modify the memory allocation and release statements, automatically insert related pragma primitives and automatically manage data transmission between the CPU and the accelerator on the basis of the OpenMP CPU program, so that the data consistency of the program containing a complex pointer data structure on a heterogeneous computing platform is ensured, and the program performance is improved.

The heterogeneous programming scheme researched by the invention mainly aims at the automatic management of a complex pointer data structure in an OpenMP program, and specifically aims at automatically modifying memory allocation and release statements, automatically inserting relevant pragma primitives and automatically managing the data transmission between a CPU and an accelerator; therefore, the data consistency of the program on the heterogeneous computing platform is ensured, and the program performance is improved.

Since OpenMP Offloading supports accelerator programming, OpenMP code on the CPU can be offloaded to the GPU for execution, the objective condition allows Offloading of OpenMP code. Under the condition, the OpenMP code is unloaded to the GPU for running, so that on one hand, the running efficiency of the program can be improved, and on the other hand, the acceleration effect of the GPU can be fully utilized. However, manual code unloading cannot ensure the correctness of the conversion program, consumes manpower and material resources, and is very low in efficiency. Therefore, the invention provides an automatic management scheme of the complex pointer data structure, so as to solve the problems of memory allocation, data transmission and the like of the most difficult complex pointer data structure between the CPU and the accelerator in the unloading process.

Through comparison experiments on a general test set (PolyBench, Rodinia and the like), the method provided by the invention is proved to be capable of automatically managing a complex pointer data structure in an OpenMP off program, ensuring the correctness of the program and improving the performance of the program.

The complex pointer data structure described in the present invention refers to a complex pointer data type.

Drawings

FIG. 1 is an overall block diagram of the system of the present invention; fig. 2 is a block diagram of the structure of the AST analysis method; FIG. 3 is a function string parallel control flow graph; FIG. 4 is a unified memory schematic; FIG. 5 is a schematic diagram of an implementation of automatic offload of processing class nested pointers (program comparison of whether unified memory is used); FIG. 6 is a schematic diagram of an automatic unload implementation of a process vector container (procedural comparison of whether a space configurator is used or not); FIG. 7 is a schematic diagram of an implementation of automatic offload processing of multi-level nested pointers; FIG. 8 is a block diagram of the overall experimental design of the present invention;

FIG. 9 is a histogram of runtime at 2080Ti platform Large data volume, with: FIG. 9(a) is a Large-2080Ti runtime comparison graph (data set a), FIG. 9(b) is a Large-2080Ti runtime comparison graph (data set b), and FIGS. 9(a) and 9(b) are essentially one graph, which is divided into two graphs because there are many data sets;

FIG. 10 is a run time comparison of the Rodinia test set on 2080 Ti; FIG. 11 is a comparison graph of acceleration ratios for different data volumes for Polybench-OAO (2080 Ti); FIG. 12 is a comparison of Polybench acceleration ratios (2080 Ti); FIG. 13 is an on-stream overhead graph (2080 Ti); FIG. 14 is a detail test chart of the complex pointer data structure (K40);

FIG. 15 is an example of program conversion with vector; FIG. 16 is an example of program conversion including structnest; FIG. 17 is an example of program conversion with multiple levels. In fig. 15 to 17: vector is a vector container in c + +; structnest is a class nesting pointer; multilevel is a multi-level nested pointer.

Detailed Description

With reference to fig. 1 to 17, the following description is made on an automatic management system for a complex pointer data structure oriented to a heterogeneous platform according to the present invention:

the main task of the invention is to automatically manage the complex pointer data structure (class nesting pointer, vector container, multilevel nesting pointer, etc.) in the OpenMP Offloading program, namely, to realize automatic modification, distribution and release of statements, to automatically manage data transmission, and to ensure data consistency. The invention mainly comprises the following three modules:

the information collection module has two functions: 1) performing static analysis on the source program to collect program information; 2) and establishing an abstract representation, namely a serial-parallel control flow graph, for the source program based on the collected information. The working process of the module comprises the following two steps: 1) generating a corresponding Abstract Syntax Tree (AST) and a Control Flow Graph (CFG) from a C or C + + source code through a Clang compiler, traversing the AST and the CFG, distinguishing a serial-parallel domain, and acquiring detailed program information; 2) and generating a serial-parallel control flow graph according to the information.

And the automatic conversion module is mainly responsible for inserting a runtime API into the source code based on the serial-parallel control flow graph so as to complete code conversion. Firstly, determining the type of a complex pointer variable according to complex pointer variable information stored in a serial-parallel control flow diagram. And then according to different types, inserting a proper runtime API into a proper position in the source code to finish code conversion. Therefore, the memory allocation, release and data transmission operations related to the complex pointer variables are all taken over by the runtime, so that the complex pointer variables can be automatically managed by the runtime, and the data consistency between the CPU and the GPU is ensured.

The runtime module is mainly responsible for realizing the following operations of the complex pointer data structure based on the unified memory: memory allocation and release operations on the CPU and the GPU, and automatic data transmission operation between the CPU and the GPU; wherein C + + default interfaces of new, delete, and allocator are re-implemented using cudaMallocManaged () and cudaFree (), and an interface for memory allocation and release is provided to the outside in the form of an API interface.

The result is to convert the source program into a new program with runtime API, and the overall framework of the system is shown in fig. 1.

1 information collecting module

The information collection module implements two functions: statically analyzing and collecting program information; and establishing a serial-parallel control flow graph.

1.1 information Collection

Firstly, a Clang compiler is used to perform lexical analysis and semantic analysis on a target program, and an Abstract Syntax Tree (AST) and a Control Flow Graph (CFG) are generated.

And then, performing the following three static analyses on the AST and the CFG to obtain function call relation information, variable related information and serial/parallel domain related information, and providing information support for the subsequent serial-parallel control flow diagram establishment and code conversion.

1) recursively traversing each node on the AST;

2) if the current node is the function definition node, the function name and the sub-function information of the function call are saved.

1) recursively traversing each node on the AST;

3) and if the current node relates to memory allocation or release, storing memory allocation information and memory release information.

3) and if the current node is not in the scope of the OpenMP # pragma parallel guidance statement, marking the current node as a serial node, and storing the type information and the range information of the serial node.

Through the three static analyses described above, the following information can be obtained: function call relation information, variable definition information, variable reference information, variable scope information, memory allocation information, memory release information, serial-parallel node type information, serial-parallel node range information and inter-node relation information. This information will provide support for the serial-parallel control flow graph setup.

1.2 Serial-to-parallel control flow graph establishment

The defined serial-parallel control flow graph is composed of serial nodes, parallel nodes and directed edges between the nodes. The serial node represents a code segment which is outside the scope of action of an OpenMP # pragma parallel instruction statement, has no branch inside and is executed in series; the code segments corresponding to the serial nodes are executed on the CPU, and the serial nodes are also denoted as SEQ nodes.

The parallel node represents a code segment which is in the action domain of the OpenMP # pragma parallel instruction statement and is executed in parallel; and unloading the code segments corresponding to the parallel nodes to the GPU for execution, wherein the parallel nodes are also named as OMP nodes. Function calling information and variable related information are stored in the serial node and the parallel node. And the directed edges among the nodes represent the sequential relation of the execution of the code segments corresponding to the nodes.

The establishing process of the serial-parallel control flow graph takes a function as a basic processing unit; for the whole source program, if a serial-parallel control flow graph of a function can be established, the serial-parallel control flow graph of the whole source program can be recursively established by combining function call relation information.

For a function, based on collected program information, a serial-parallel control flow graph may be built by:

2 automatic conversion module

The automatic conversion module divides the complex pointer data types into: the device comprises three types of a class nesting pointer, a vector container and a multi-level nesting pointer.

The key technology in the complex pointer data structure processing is Unified Memory (UM), so the concept and principle of Unified Memory are introduced first, UM maintains a Unified Memory pool, which is shared between CPU and GPU, and only once Memory is allocated, and this data pointer is available to both host end (CPU) and device end (GPU). A single pointer is used to host memory so that data transfers between different devices are done automatically by the UM's runtime system and allow the GPU to process data sets that exceed its memory capacity. The unified memory principle is shown in fig. 4.

Different processing methods are respectively designed for three complex pointer data types by utilizing a unified memory technology.

2.1 class nested pointers

The class nesting pointer refers to a pointer included in a class, and is processed by the following steps as shown in fig. 5:

recursively traversing the serial-parallel control flow graph, and finding out the class nested pointers which are defined in the serial nodes and are referred in the parallel nodes and the classes in which the class nested pointers are located according to the variable definition and the reference information stored in the serial/parallel nodes.

Modifying the type definition of the class in the source code for the class type found in the step (i) so that the type of the class inherits the UMMaper base class provided by the runtime during the definition (see the runtime design);

and fourthly, modifying the definition sentences of the class examples in the source codes for the class examples found in the step III, creating the class examples by using the reloaded new operator, and transmitting the memory space addresses distributed in the step III to corresponding nested pointers in the class.

An example of the processing of class nesting pointers is shown in FIG. 5.

2.2vector Container

The vector container refers to a vector container provided in a C + + standard library. The vector container needs to be processed by the following steps:

recursively traversing the serial-parallel control flow graph, and finding out a vector container which is defined in the serial node and is referred to in the parallel node according to variable definition and reference information stored in the serial/parallel node.

And secondly, modifying the definition statement of the vector container instance found in the step (i), and inserting display call of a custom allocator (see runtime design) provided by runtime.

A vector container processing example is shown in fig. 6.

2.3 Multi-level nested pointers

The multi-level nested pointers refer to pointers at two levels and pointers above two levels. The multi-level nested pointers need to be processed by:

recursively traversing a serial-parallel control flow graph, and finding out a multi-level nested pointer which is defined in a serial node and is referred in a parallel node and sub pointers of each level of the pointer according to variable definition and reference information stored in the serial/parallel node;

An example of the processing of multi-level nested pointers is shown in FIG. 7.

3 run time

In order to make the converted code closer to the format of the source code and reduce the modification of the source code as much as possible, the proposal of inserting a small amount of runtime API in the original program is provided, and the memory allocation based on the unified memory is encapsulated in the runtime.

For classes and their nested pointers, the UMMapper base class responsible for memory allocation and release is implemented in runtime, as shown in fig. 5. The C + + default new, delete operator is reloaded in UMMapper using cudamallmanager () and cudaFree (). Therefore, only the UMmapper derivation is needed during class declaration, and new class instance is built by using new, so that the memory allocation, release and data transmission operations related to the derived classes can be automatically managed through unified memory.

The vector container uses an allocator space configurator of a C + + standard library to manage memory allocation and release, so that a user-defined allocator class can be realized based on a uniform memory, and automatic management of memory allocation and release of the vector container is realized. In the custom allocator class, a _ allocator () function in the allocator class is implemented based on cudamallocManager (), and a dealloate () function in the allocator class is implemented based on cudaFree (). When the vector container is declared, the user-defined allocator class is displayed and called, so that the allocation and release of the memory space of the vector container on the CPU and the GPU and the automatic management of the data transmission operation by the unified memory can be realized, as shown in fig. 6.

The technical effects of the invention are verified as follows:

to verify and analyze the effect of the code off-loading scheme, we performed tests on the RTX 2080Ti platform, using a benchmarking dataset such as PolyBench, Rodinia, etc., comparing the test results with another well-known source-to-source compiler (DawnCC) and Manual translation (Manual) results and CPU-parallel code. Our scheme is named OAO (OpenMP Automated advertising with complete data structure support).

The test and the acquisition of the experimental result need to be integrated and run, and the four stages of script running, Python extraction and originLab analysis are required. The whole experimental process is shown in fig. 8.

1 run time

The run times for 2080Ti platform Large data volumes are shown in FIG. 9, respectively.

The Rodinia test set was run time compared on 2080Ti as shown in fig. 10.

From the above run-time collection and comparison of data sets, the following conclusions can be drawn:

(1) as can be seen from the run-time diagram, OAO can handle all 23 test programs, whereas DawnCC can only handle 15 test programs. OAO has improved performance on 9 programs of K40 platform and 15 programs of 2080Ti platform; and the OAO has the best performance on all test programs for all platforms relative to manual conversion.

(2) In Rodinia-2080Ti, it is known that the amount of Rodinia data was obtained from pre-experimental tests, and in total 8 data sets, 5 data sets had shorter OAO runtime than OMP.

(3) In Polybench-Large, the running time of OAO is better than that of DawnCC, Manual and native; runtime as a whole shows a trend of OAO < DawnCC < Manual < native. The native has the longest running time because of the problem of the selected transmission statement, and parallel optimization is not performed; manual contains redundant transmission of partial data, and the running time is longer than OAO; OAO is also generally superior to DawnCC in the processing of data sets.

2 acceleration ratio

As can be seen from fig. 11, in the aspect of the acceleration ratio, in the data larger than 1, the Large data amount acceleration ratio is larger than the Medium, which is also in line with the characteristic of parallel acceleration, and the larger the data amount is processed, the more obvious the parallel acceleration effect is.

In the aspect of acceleration ratio, the calculation mode is the ratio of OMP/Type, the Type is other types, the result uses 1X as the boundary, the upper part is positive acceleration, and the lower part is negative acceleration. As can be seen from fig. 12, no matter on which platform, the results of all test programs show that the acceleration ratio of OAO is better than Manual, among the Polybench test programs, OAO is slightly better than DawnCC, and because DawnCC cannot process the rodiia test program, OAO has the best data transmission optimization effect on all test programs, and can process the complex test program which cannot be processed by DawnCC, and the application range is wider.

3 runtime overhead

In the Polybench test set, the overhead at runtime is shown in fig. 13.

Compared with the runtime overhead, the runtime overhead of all test programs is smaller by more than 2 orders of magnitude, so the OAO has small and almost negligible overhead during runtime, and the performance of the source program is not lost when the OAO is added into runtime.

4 unified memory

The detailed test of the unified memory comprises a vector container, a class nesting pointer (structnest) and a multilevel nesting pointer (multilevel). The runtime and data transfer time pairs for complex pointer data structures are shown in fig. 14.

From the data transfer time perspective, the time of H2D and D2H is much less than the total runtime, and all complex pointer data structures can be processed correctly, so that the OAO compiler can correctly support GPU offload of the complex pointer data structures.

A simple conversion program comparison of vector container, structnest and multilevel is shown in FIG. 15, FIG. 16 and FIG. 17.

Wherein the statement phase vector variable is modified to QY:: allocator mode override, as shown in FIG. 15; in FIG. 16, the declaration phase class inside nesting pointers x, y, z are reloaded by cudaMallocManged, and the instantiated class a is marked by a transport statement, adding a map (from: a [0:1 ]); in FIG. 17, x is the secondary pointer, y is the primary pointer of x, and it is declared that phase x, y is reloaded by cudaMallocManager.

Table 6 unified memory test result table

The unified memory test results are shown in table 6. In the data transmission process, prior to formal transmission, one transmission is preferentially performed, the data size of H2D is 4B, the data size of D2H is 1B (4+1 mode), bidirectional transmission of vector in the above table, 52 is 3 greater than 49, and bidirectional transmission of structnest, 36 is 3 greater than 33, for this reason. In addition, vector is transmitted only 2 times, X and Y respectively, the data size of each time is 24B, structnest is transmitted 1 time, a is transmitted, and the data size is 32.

In the multilevel, z is a unidirectional transmission and uses a data transmission model, and x and y are multi-level nested pointers and use a unified memory. The reason for the difference in the amount of data transfer between H2D and D2H, apart from the 4+1 mode of preferential transfer, is that z does not return, while x does.

The test result proves that the OAO can correctly process the complex pointer data structure by using the unified memory, and the support for processing the complex pointer data structure is realized.

Claims

1. An automatic management system of a complex pointer data structure facing a heterogeneous platform is characterized in that the system is used for realizing automatic management of the complex pointer data structure in an OpenMP off program on the heterogeneous computing platform and ensuring data consistency;

the system comprises the following three modules:

the working process of the module comprises the following two steps: 1) generating a corresponding abstract syntax tree AST and a control flow graph CFG from a C or C + + source code through a Clang compiler, traversing the AST and the CFG, distinguishing a serial-parallel domain, and acquiring detailed program information; 2) generating a serial-parallel control flow graph according to the information;

2. The system for automatically managing the complex pointer data structure oriented to the heterogeneous platform according to claim 1, wherein the information collection function of the information collection module is implemented by:

firstly, performing lexical analysis and semantic analysis on a target program by using a Clang compiler, and generating an abstract syntax tree AST and a control flow graph CFG;

1) recursively traversing each node on the AST;

3. The automatic management system for the complex pointer data structure oriented to the heterogeneous platform according to claim 1 or 2, characterized in that the serial-parallel control flow graph in the information collection module is defined as follows:

4. The system according to claim 3, wherein the process of establishing the serial-parallel control flow graph in the information collection module comprises:

the establishing process of the serial-parallel control flow graph takes a function as a basic processing unit; for the whole source program, if a serial-parallel control flow graph of a function can be established, combining the function call relation information collected in claim 2, the serial-parallel control flow graph of the whole source program can be recursively established;

for a function, based on the information gathered in claim 2, a serial-parallel control flow graph can be built by:

5. The automatic management system for the complex pointer data structure oriented to the heterogeneous platform according to claim 4, wherein the complex pointer data is classified into three types according to the position of the pointer:

a class nesting pointer, a vector container and a multilevel nesting pointer;

6. The heterogeneous platform-oriented automatic management system for the complex pointer data structure of the claim 5 is characterized in that the class nesting pointer is processed by the following steps:

recursively traversing the serial-parallel control flow graph established in the claim 4, and finding out the class nesting pointers defined in the serial nodes and referenced in the parallel nodes and the C + + classes thereof according to the variable definitions and the reference information stored in the serial/parallel nodes;

7. The system according to claim 5, wherein the vector container is processed by the following steps:

recursively traversing the serial-parallel control flow graph established in the claim 4, and finding out a vector container which is defined in the serial node and is referred in the parallel node according to variable definition and reference information stored in the serial/parallel node;

8. The heterogeneous platform-oriented complex pointer data structure automatic management system according to claim 5, wherein the multi-level nested pointers are processed by the following steps:

recursively traversing the serial-parallel control flow graph established in claim 4, and finding out multi-level nested pointers defined in serial nodes and referenced in parallel nodes and sub-pointers of each level thereof according to variable definitions and reference information stored in the serial/parallel nodes;

9. The system of claim 6, wherein the implementation process of the ummale class in the runtime module is as follows:

designing a UMMaper class to manage the memory allocation and release of the C + + class for processing the class nesting pointers; reloading a C + + default new and delete operator by using cudamallmanaged () and cudaFree () in the UMMaper class; the allocation, release and data transmission operations of the derived classes of the UMMaper class on the memory space of the CPU and the GPU can be automatically managed by the unified memory.

10. The system according to claim 7, wherein the implementation process of the custom allocator class in the runtime module is as follows:

designing a self-defined allocator class to manage memory allocation and release of the vector container for processing the vector container; the vector container defaults to use an allocator space configurator of a C + + standard library to manage memory allocation and release, so that a user-defined allocator class can be realized based on a uniform memory, and automatic management of memory allocation and release of the vector container is realized; in the user-defined allocator class, realizing a _ allocator () function in the allocator class based on cudaMallocManager (), and realizing a _ deallocate () function in the allocator class based on cudaFree (); and displaying and calling the custom allocator class when the vector container declares, so that the allocation and release of the memory space of the vector container on the CPU and the GPU and the automatic management of data transmission operation can be realized by the unified memory.