CN114398039A

CN114398039A - Automatic fine-grained two-stage parallel translation method

Info

Publication number: CN114398039A
Application number: CN202111464906.5A
Authority: CN
Inventors: 刘金硕; 黄朔; 邓娟; 刘宁; 王晨阳; 唐浩洲
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-12-03
Filing date: 2021-12-03
Publication date: 2022-04-26

Abstract

The invention provides an automatic fine-grained two-stage parallel translation method, which comprises the steps of firstly analyzing a source C code through ANTLR, automatically generating EBNF grammar description, and generating a corresponding lexical method and a grammar analyzer. The loop information extracted from the parser is then analyzed and if stream dependencies are found, the loop statements containing these dependencies are not parallelizable. And if the anti-dependency relationship and the output dependency relationship between the data are found, eliminating the dependency relationship. Such a loop statement is parallelizable if there are no data dependencies. And then the parallelizable loop structure is mapped to a structure suitable for the multi-thread execution of the CUDA and the CPU, and then a corresponding CUDA code and a corresponding CPU multi-thread code are generated.

Description

Automatic fine-grained two-stage parallel translation method

Technical Field

The invention relates to the technical field of computers, in particular to an automatic fine-grained two-stage parallel translation method.

Background

Since NVIDIA released GeForce 256 graphics processing chips in 1999 and proposed the GPU concept, it has become one of the main options of accelerator components of current high performance computing systems, widely used for compute-intensive programs, due to its powerful computing functions, flexible programming functions and low power consumption. For the same task, compared with executing a serial program on a CPU, running a parallel program on a GPU can greatly reduce the running time, and particularly when large data is processed, GPU parallel computing has more obvious advantages.

Current parallel programming methods include MPI, OpenCL, OpenMP, etc., but manual or semi-automatic conversion of a large number of serial programs into parallel programs remains a significant challenge. In addition, some automatic translation tools are available on the market, for example, PPCG is a source-to-source compiler that uses a multi-level slicing strategy of polyhedral parallel code generation and C-to-CUDA conversion, and Bones is a skeleton-based source-to-source automatic parallelization method for converting C into five types of object code.

Existing translation tools or methods can only convert a serial program into one parallel mode, such as CPU multithreading or GPU. However, even if the GPU has excellent acceleration capability, the CPU must wait for the GPU to complete its computational tasks, wasting the computational resources of the GPU.

Disclosure of Invention

The invention provides an automatic fine-grained two-stage parallel translation method, which is used for solving or at least partially solving the technical problem of low calculation efficiency in the prior art.

In order to solve the technical problem, the invention provides an automatic fine-grained two-stage parallel translation method, which comprises the following steps:

s1: analyzing the serial source code by adopting a preset syntax analyzer to generate an abstract syntax tree corresponding to the source code, traversing the abstract syntax tree and extracting a cycle structure;

s2: analyzing the data dependency relationship in the extracted loop structure, judging whether the data dependency relationship exists or not, wherein the data dependency relationship comprises a flow dependency relationship, an anti-dependency relationship and an output dependency relationship, and if the data dependency relationship does not exist, directly performing parallelization processing on the loop structure; if the data dependency exists, processing according to the type of the data dependency, specifically including: when the data dependency is a stream dependency, the loop structure is marked as being incapable of being parallelized, and when the data dependency is an anti-dependency or an output dependency, the loop structure is processed by adopting an array privatization technology, wherein the processing of the loop structure by adopting the array privatization technology comprises the following steps: localizing a storage unit corresponding to loop iteration in a loop structure, and eliminating an anti-dependency relationship and an output dependency relationship caused by variable reuse;

s3: and mapping the parallelizable loop structure to a CUDA and CPU multithreading execution structure to generate corresponding CUDA codes and CPU multithreading codes, wherein the CPU creates threads with corresponding number, one thread is responsible for GPU scheduling, other threads are responsible for executing parallel tasks distributed to the CPU, and meanwhile, the GPU executes the tasks distributed in the GPU scheduling.

In one embodiment, step S1 includes:

s1.1: creating an EBNF description of the serial source code using an ANTLR tool;

s1.2: performing lexical analysis, matching characters in the EBNF description of the serial source code, masking or filtering irrelevant content, and generating a tag for syntactic analysis;

s1.3: executing syntax analysis, analyzing the generated marks, and generating an abstract syntax tree corresponding to the source code;

s1.4: and traversing the generated abstract syntax tree, and extracting a loop structure, wherein the loop structure comprises loop nesting levels and loop-related variable information, and the loop-related variable information comprises variable names, variable types and row numbers for storing loop positions.

In one embodiment, step S1.3 adds rule parameters to enable the transfer of context information when performing the parsing.

In one embodiment, analyzing the extracted data dependency relationship in the loop structure to determine whether a data dependency relationship exists includes:

if one storage unit in the current loop structure is written in one iteration and then read in the subsequent iteration, the stream dependency relationship exists in the loop structure;

if a storage unit is read in one iteration in the current loop structure and then written in a storage unit in the subsequent iteration, the fact that the anti-dependency relationship exists in the loop structure is shown;

if a storage unit in the current loop structure is written in one iteration and then written again in the subsequent iteration, the output dependency relationship in the loop structure is shown to exist.

One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:

the invention provides an automatic fine-grained two-stage parallel translation method, which comprises the steps of firstly adopting a preset syntax analyzer to analyze serial source codes, then analyzing data dependency in an extracted loop structure, judging whether the data dependency exists, adopting an array privatization technology to eliminate the anti-dependency or output dependency when the data dependency is the anti-dependency or output dependency, and finally mapping a parallelizable loop structure to a CUDA and CPU multithreading execution structure. By the method, the source C code can be automatically translated into codes on the multi-thread CPU and the GPU, namely the translation result is divided into the C code on the CPU and the CUDA code on the GPU, the C code on the CPU and the CUDA code on the GPU are executed in parallel, and the CPU is used as a host to be responsible for executing serial calculation such as control logic, transaction processing and the like; the GPU is used as a coprocessor or an equipment end and is responsible for executing large-scale data parallel computation with high computation density and simple logic branches, so that the computation efficiency is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of an automatic fine-grained two-stage parallel translation method according to an embodiment of the present invention.

FIG. 2 is a diagram illustrating a code fragment using rule parameters according to an embodiment of the present invention.

FIG. 3 is a diagram illustrating the declaration of a loop statement using a return value in an embodiment of the present invention.

FIG. 4 is a diagram illustrating an embodiment of the present invention being assigned a new value in a current iteration and then used in a next iteration.

FIG. 5 is a diagram illustrating an embodiment of the present invention in which an array is assigned outside a loop and reused in an iteration.

FIG. 6 is a schematic diagram of the privatization of the array in FIG. 5 according to an embodiment of the present invention.

FIG. 7 is a diagram illustrating the first allocation of an array and the subsequent reuse in the same iteration according to an embodiment of the present invention.

FIG. 8 is a diagram illustrating the privatization of the array shown in FIG. 7 according to an embodiment of the present invention.

FIG. 9 is a simplified mapping template in accordance with an embodiment of the present invention.

FIG. 10 is a flow chart of the parser operation in an embodiment of the present invention.

Detailed Description

The invention relates to an automatic fine-grained two-stage parallel translation method for big data task processing. From the EBNF description, ANTLR generates the corresponding lexical and syntactic parsers for the abstract syntax tree. And analyzing the loop information extracted from the parser, and if the stream dependency relationship is found, marking the loop statement containing the dependency relationship if the loop statement cannot be parallelized. If the anti-dependencies and output dependencies between the data are found, the loop structure is processed to eliminate the dependencies. Such a loop statement is parallelizable if there are no data dependencies. After eliminating the anti-dependency and output dependency caused by variable reuse, the storage unit related to the loop iteration is localized by adopting an array privatization technology, so that the interaction with the storage unit of other loop iterations can be separated. And mapping parallelizable loop structures (including loop structures without data dependency and loop structures after eliminating anti-dependency and output dependency) to structures suitable for CUDA and CPU multithread execution, and then generating corresponding CUDA codes and CPU multithread codes.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides an automatic fine-grained two-stage parallel translation method, which comprises the following steps:

In a specific implementation process, the preset syntax parser is ANTLR, which means an open source syntax parser that can automatically generate a syntax tree according to an input and visually display the syntax tree. ANTLR-other Tool for Language Recognition, whose predecessor is PCCTS, provides a framework for automatically constructing a recognizer, a compiler and an interpreter of a custom Language by syntactic descriptions for languages including Java, C + +, C #.

The automatic fine-grained two-stage parallel translation method provided by the invention comprises three parts, namely analyzing a source C code through ANTLR, analyzing a data dependency relationship, and eliminating an anti-dependency relationship and mapping, and particularly refers to FIG. 1.

Step S1 parses the source C code through ANTLR, scans the source code, and may then automatically generate an extended back-Naur form (EBNF) syntax description. From the EBNF description, ANTLR generates the corresponding lexical and syntactic parsers for Abstract Syntax Trees (AST).

Step S2 analyzes the loop structure (loop information included) extracted from the parser. If flow dependencies are found, the loop statements containing these dependencies are not parallelizable. If the anti-dependencies and output dependencies between the data are found, the loop structure is processed to eliminate the dependencies. Such a loop statement is parallelizable if there are no data dependencies. When the anti-dependency relationship and the output dependency relationship between the data exist, the anti-dependency relationship and the output dependency relationship caused by variable reuse are eliminated, and the storage unit related to the loop iteration is localized by adopting an array privatization technology, so that the interaction with the storage unit of other loop iterations can be separated.

Step S3 maps the parallelizable loop structures to structures suitable for CUDA and CPU multithread execution, and then generates corresponding CUDA code and CPU multithread code. The multicore CPU creates a corresponding number of threads: one thread is responsible for GPU scheduling and the other threads perform parallel tasks assigned to the CPU. At the same time, the GPU performs the tasks assigned to it.

In one embodiment, step S1 includes:

Wherein, when the syntax analysis is executed in step S1.3, rule parameters are added to implement the transfer of context information.

In the specific implementation process, analyzing the source C code through ANTLR mainly includes the following contents:

ANTLR is an automatically generated tool for language recognition. It generates a syntactic description of the source C code using EBNF rules. It performs lexical analysis and syntax analysis on the source program according to the attribute of syntax, and then generates an AST (abstract syntax tree). ANTLR provides a mechanism to traverse the AST, which can help extract the circular related information. The present invention uses ANTLR to parse the source C code to generate AST and extract the loop-related information. The work flow of the resolver is shown in fig. 10.

First, step S1.1 is performed: an EBNF description of the serial source code is created using ANTLR. The syntax described by EBNF can be represented by a quadruplet:

G(Z)＝(V_n，V_t，S，P) (1)

wherein, V_nIs a non-terminal characterA limited set of numbers; v_tIs a limited set of terminal symbols; s represents the starting symbol grammar; p represents a finite set of products (including also a set of rules). Z represents source code. The most important of the grammars is P (production set), with "a: and a' is expressed in a form. Capital letter a is the left-hand part of the production, representing a non-terminal character. The lower case letter a is on the right and may represent a non-terminal symbol and a terminal symbol.

Then step S1.2 is performed, where lexical analysis is performed, matching characters in the input stream, masking or filtering out irrelevant content, and generating tokens for syntactic analysis. To achieve this, ANTLR adds a series of filtering methods to the lexical grammar.

In source code, characters such as spaces, tabs, carriage returns, and line breaks are usually meaningless redundant characters. ANTLR provides a skip () method to skip these meaningless symbols. For example, use

WS:(”|'\t'|'\n'|'\r')+{skip()；}；

When traversing these characters, a skip () method will be called to skip the corresponding character. Annotations in the source code are meaningless at compile time and need to be reused when generating the final document. ANTLR provides a channel mechanism to hide annotations at compile time. For example, use

COMMENT:'/*'.*'*/'{$channel＝HIDDEN；}；

Matching comment blocks can be placed in the HIDDEN channel without appearing in subsequent parsing.

Then step S1.3 is performed, a syntax analysis is performed to analyze the flags of the previous step, ANTLR by default being based on no context rules. Preferably, the invention adds rule parameters to realize the transfer of the context information so as to make up the deficiency of the context-free grammar. For example, in FIG. 2, the code snippet determines whether the type of variable assignment meets requirements. In the variable declaration syntax, a rule parameter idList [ type. In order to extract variable information related to loops, the present invention adds a return value to a syntax expression of a loop statement so that information related to variables can be directly extracted from various loop statements. Fig. 3 shows an example of adding a return value int to a while statement declaration.

The AST is formed after the parsing. It stores the data structure of the sentence in the form of a tree, with each node on the tree representing a structure in the source code. Traversal of the serial C-code abstract syntax tree is mainly used to traverse the loop structure. In step S1.4, the AST is traversed using Visitors method provided by ANTLR. The invention reloads the visitForStatement () method, which stores the loop nesting level and the variable information sourceC code related to the loop, including the variable name, the variable type and the line number of the storage loop position. The collected variable information is then handed over to the next stage for processing.

In particular, dependencies refer to partial order relationships of statements in a program that reflect the inherent order required to maintain the semantics of the program. The parallelism of the program is influenced by reading and writing access to data, so the dependence to be considered in parallel conversion is data dependence.

According to the read-write operation on the same memory area, the data dependency relationship can be composed of a stream dependency relationship, an anti-dependency relationship and an output dependency relationship. In a loop structure, a stream dependency refers to a memory cell being written in one iteration and then read in a subsequent iteration; an anti-dependency refers to reading a memory cell in one iteration and then writing a memory cell in a subsequent iteration. Output dependencies refer to a memory location being written to in one iteration and then written again in a subsequent iteration.

In the specific implementation process, it is assumed that in a loop statement F (loop structure), I is an iteration space, and I (I ∈ I) is a loop control variable of one iteration in I. Under iteration i, Read_iRepresents the set of all read variables, and Write_iRepresenting the set of all written variables. Then the sufficient conditions that F can be parallelized are:

equation (2) indicates that the loop structure does not have a stream dependency, and does not have an inverse dependency or an output dependency

Where I, j ∈ I and I ≠ j. If there is a flow dependency in F, then the condition should be satisfied:

where I, j ∈ I and I > j. A write operation on the same memory region will precede a read operation. It is similar to the relationship between producer and consumer. Loop structures containing stream dependencies cannot be executed in parallel on a GPU.

If there is an anti-dependency in F, the following condition should be satisfied:

where k, l ∈ I and k < l. A read operation to the same memory region occurs before a write operation due to repeated references to the same memory region. The purpose of automatic parallel translation can be achieved by creating a temporary storage area. If an output dependency exists in F (the loop structure F contains output dependencies), the following condition should be satisfied:

where m, n ∈ I and m ≠ n. A write operation on the same storage area occurs at least twice. For both the anti-dependency relationship and the output dependency relationship, the purpose of automatic parallel conversion can be realized by creating a temporary storage area. Specifically, the loop-related information extracted from the parser is used as an input, and then the data dependency analysis is performed on the input. Through data dependency analysis, if the anti-dependency or output dependency between data is found, the loop structure containing the dependencies is transferred to the next loop array privatization stage for processing. If a flow dependency is found, statements such as loop structures that cannot be parallelized are marked. If no data dependencies can be found, this loop structure can be parallelized directly.

In serial C code, reusing the same variables can pose a significant obstacle to automatic parallel conversion. The use of variables results in data dependencies, anti-dependencies and output dependencies of the memory addresses. The present invention designs a loop array privatization phase (array privatization technique) to eliminate these dependencies. The explicit representation of the memory locations in the loop statements are variables and arrays. A variable can also be considered as a special representation of an array, meaning that there is only one array of elements. In serial C code, a global array is typically used to store data, thereby reducing memory space. The global array will be used in each iteration in the loop statement. Loop array privatization privatizes the original reuse space using new storage space in each iteration, so there are no cross-iteration dependencies. In addition to the initialization of arrays, the locations of array reassignments in loop statements can be divided into three categories. One is to assign a new value to the array in the current iteration and then use it in the next iteration. The other is to assign a value to the array outside the loop and reuse it in iterations. The third category is that the array is allocated first and then reused in the same iteration. Fig. 4 shows an example of the first category. In this loop statement, when i ═ 0, the value "temp 2+ 2" is assigned to the array a in the fifth row. When i is 1, the array a is used in the fourth row of the sentence "temp 1 is a + 1". This is a loop with stream dependencies, so the loop cannot be translated by loop array privatization. Fig. 5 shows an example of the second category. In this code segment, array A is assigned the value "1" outside of the loop statement used in the fifth line of the loop. In this case, array A may be privatized, as shown in FIG. 6. Fig. 7 shows an example of the third category. In this loop statement, when i is 0, the array a is first assigned "temp 2+ 2" in the fourth row. It is then used in the statement "temp 1 ═ a + 1" in the fifth row. The array is allocated and reused in the same iteration. In this case, array A may be privatized, as shown in FIG. 8. The first condition to privatize a loop statement is that there are no flow dependencies between iterations. The second condition is to reallocate the array outside the loop, or before use in the same iteration. Combining the conditions of the anti-dependency relationship and the output dependency relationship, adopting the necessary conditions of array privatization:

wherein I, j, k, l, m, n belongs to I and I > j, l > k, m ≠ n

The above equation indicates that there is no stream dependency but there is an anti-dependency or output dependency.

In the specific implementation process, the mapping process of S3 is a two-level hybrid mapping, and in order not to waste any computing resource, the present invention expects to obtain a target two-level parallelization code: CPU multithreading and GPU CUDA, mapping the parallelizable loop structure to C multithreading code and CUDA code. FIG. 9 provides a simplified template of the mapping. The code structure after mapping is shown as a template. Marked as "parallel" are parallelizable blocks of code that are parsed and analyzed. The key "kthread" represents the number of threads that should be created on the CPU. "ctasks" and "gtasks" represent tasks to be processed on the CPU and GPU, respectively. The loop function includes a parallelizable loop structure (top of fig. 9), the CPU creates a corresponding number of threads, one of which is responsible for GPU scheduling, while the other threads execute parallel tasks. The CUDA kernel (bottom of fig. 9) is under the schedule of the CPU thread and contains the same parallelizable loops to execute. The mapping implementation process comprises the steps of firstly mapping serial codes of a loop structure into CUDA parallel codes, creating a CUDA core kernel function, and running on a GPU. Second, a thread is created on the CPU to schedule the GPU and execute the serial code.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An automatic fine-grained two-stage parallel translation method is characterized by comprising the following steps:

2. The two-stage parallel translation method according to claim 1, wherein step S1 includes:

3. A method for two-level parallel translation as claimed in claim 1, characterized in that in step S1.3, when performing syntax analysis, rule parameters are added for enabling the transfer of context information.

4. The two-stage parallel translation method according to claim 1, wherein analyzing the extracted data dependency relationship in the loop structure to determine whether a data dependency relationship exists comprises: