CN114398039A - Automatic fine-grained two-stage parallel translation method - Google Patents

Automatic fine-grained two-stage parallel translation method Download PDF

Info

Publication number
CN114398039A
CN114398039A CN202111464906.5A CN202111464906A CN114398039A CN 114398039 A CN114398039 A CN 114398039A CN 202111464906 A CN202111464906 A CN 202111464906A CN 114398039 A CN114398039 A CN 114398039A
Authority
CN
China
Prior art keywords
loop
dependency relationship
loop structure
iteration
dependency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111464906.5A
Other languages
Chinese (zh)
Inventor
刘金硕
黄朔
邓娟
刘宁
王晨阳
唐浩洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202111464906.5A priority Critical patent/CN114398039A/en
Publication of CN114398039A publication Critical patent/CN114398039A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/433Dependency analysis; Data or control flow analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention provides an automatic fine-grained two-stage parallel translation method, which comprises the steps of firstly analyzing a source C code through ANTLR, automatically generating EBNF grammar description, and generating a corresponding lexical method and a grammar analyzer. The loop information extracted from the parser is then analyzed and if stream dependencies are found, the loop statements containing these dependencies are not parallelizable. And if the anti-dependency relationship and the output dependency relationship between the data are found, eliminating the dependency relationship. Such a loop statement is parallelizable if there are no data dependencies. And then the parallelizable loop structure is mapped to a structure suitable for the multi-thread execution of the CUDA and the CPU, and then a corresponding CUDA code and a corresponding CPU multi-thread code are generated.

Description

Automatic fine-grained two-stage parallel translation method
Technical Field
The invention relates to the technical field of computers, in particular to an automatic fine-grained two-stage parallel translation method.
Background
Since NVIDIA released GeForce 256 graphics processing chips in 1999 and proposed the GPU concept, it has become one of the main options of accelerator components of current high performance computing systems, widely used for compute-intensive programs, due to its powerful computing functions, flexible programming functions and low power consumption. For the same task, compared with executing a serial program on a CPU, running a parallel program on a GPU can greatly reduce the running time, and particularly when large data is processed, GPU parallel computing has more obvious advantages.
Current parallel programming methods include MPI, OpenCL, OpenMP, etc., but manual or semi-automatic conversion of a large number of serial programs into parallel programs remains a significant challenge. In addition, some automatic translation tools are available on the market, for example, PPCG is a source-to-source compiler that uses a multi-level slicing strategy of polyhedral parallel code generation and C-to-CUDA conversion, and Bones is a skeleton-based source-to-source automatic parallelization method for converting C into five types of object code.
Existing translation tools or methods can only convert a serial program into one parallel mode, such as CPU multithreading or GPU. However, even if the GPU has excellent acceleration capability, the CPU must wait for the GPU to complete its computational tasks, wasting the computational resources of the GPU.
Disclosure of Invention
The invention provides an automatic fine-grained two-stage parallel translation method, which is used for solving or at least partially solving the technical problem of low calculation efficiency in the prior art.
In order to solve the technical problem, the invention provides an automatic fine-grained two-stage parallel translation method, which comprises the following steps:
s1: analyzing the serial source code by adopting a preset syntax analyzer to generate an abstract syntax tree corresponding to the source code, traversing the abstract syntax tree and extracting a cycle structure;
s2: analyzing the data dependency relationship in the extracted loop structure, judging whether the data dependency relationship exists or not, wherein the data dependency relationship comprises a flow dependency relationship, an anti-dependency relationship and an output dependency relationship, and if the data dependency relationship does not exist, directly performing parallelization processing on the loop structure; if the data dependency exists, processing according to the type of the data dependency, specifically including: when the data dependency is a stream dependency, the loop structure is marked as being incapable of being parallelized, and when the data dependency is an anti-dependency or an output dependency, the loop structure is processed by adopting an array privatization technology, wherein the processing of the loop structure by adopting the array privatization technology comprises the following steps: localizing a storage unit corresponding to loop iteration in a loop structure, and eliminating an anti-dependency relationship and an output dependency relationship caused by variable reuse;
s3: and mapping the parallelizable loop structure to a CUDA and CPU multithreading execution structure to generate corresponding CUDA codes and CPU multithreading codes, wherein the CPU creates threads with corresponding number, one thread is responsible for GPU scheduling, other threads are responsible for executing parallel tasks distributed to the CPU, and meanwhile, the GPU executes the tasks distributed in the GPU scheduling.
In one embodiment, step S1 includes:
s1.1: creating an EBNF description of the serial source code using an ANTLR tool;
s1.2: performing lexical analysis, matching characters in the EBNF description of the serial source code, masking or filtering irrelevant content, and generating a tag for syntactic analysis;
s1.3: executing syntax analysis, analyzing the generated marks, and generating an abstract syntax tree corresponding to the source code;
s1.4: and traversing the generated abstract syntax tree, and extracting a loop structure, wherein the loop structure comprises loop nesting levels and loop-related variable information, and the loop-related variable information comprises variable names, variable types and row numbers for storing loop positions.
In one embodiment, step S1.3 adds rule parameters to enable the transfer of context information when performing the parsing.
In one embodiment, analyzing the extracted data dependency relationship in the loop structure to determine whether a data dependency relationship exists includes:
if one storage unit in the current loop structure is written in one iteration and then read in the subsequent iteration, the stream dependency relationship exists in the loop structure;
if a storage unit is read in one iteration in the current loop structure and then written in a storage unit in the subsequent iteration, the fact that the anti-dependency relationship exists in the loop structure is shown;
if a storage unit in the current loop structure is written in one iteration and then written again in the subsequent iteration, the output dependency relationship in the loop structure is shown to exist.
One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:
the invention provides an automatic fine-grained two-stage parallel translation method, which comprises the steps of firstly adopting a preset syntax analyzer to analyze serial source codes, then analyzing data dependency in an extracted loop structure, judging whether the data dependency exists, adopting an array privatization technology to eliminate the anti-dependency or output dependency when the data dependency is the anti-dependency or output dependency, and finally mapping a parallelizable loop structure to a CUDA and CPU multithreading execution structure. By the method, the source C code can be automatically translated into codes on the multi-thread CPU and the GPU, namely the translation result is divided into the C code on the CPU and the CUDA code on the GPU, the C code on the CPU and the CUDA code on the GPU are executed in parallel, and the CPU is used as a host to be responsible for executing serial calculation such as control logic, transaction processing and the like; the GPU is used as a coprocessor or an equipment end and is responsible for executing large-scale data parallel computation with high computation density and simple logic branches, so that the computation efficiency is greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flowchart of an automatic fine-grained two-stage parallel translation method according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating a code fragment using rule parameters according to an embodiment of the present invention.
FIG. 3 is a diagram illustrating the declaration of a loop statement using a return value in an embodiment of the present invention.
FIG. 4 is a diagram illustrating an embodiment of the present invention being assigned a new value in a current iteration and then used in a next iteration.
FIG. 5 is a diagram illustrating an embodiment of the present invention in which an array is assigned outside a loop and reused in an iteration.
FIG. 6 is a schematic diagram of the privatization of the array in FIG. 5 according to an embodiment of the present invention.
FIG. 7 is a diagram illustrating the first allocation of an array and the subsequent reuse in the same iteration according to an embodiment of the present invention.
FIG. 8 is a diagram illustrating the privatization of the array shown in FIG. 7 according to an embodiment of the present invention.
FIG. 9 is a simplified mapping template in accordance with an embodiment of the present invention.
FIG. 10 is a flow chart of the parser operation in an embodiment of the present invention.
Detailed Description
The invention relates to an automatic fine-grained two-stage parallel translation method for big data task processing. From the EBNF description, ANTLR generates the corresponding lexical and syntactic parsers for the abstract syntax tree. And analyzing the loop information extracted from the parser, and if the stream dependency relationship is found, marking the loop statement containing the dependency relationship if the loop statement cannot be parallelized. If the anti-dependencies and output dependencies between the data are found, the loop structure is processed to eliminate the dependencies. Such a loop statement is parallelizable if there are no data dependencies. After eliminating the anti-dependency and output dependency caused by variable reuse, the storage unit related to the loop iteration is localized by adopting an array privatization technology, so that the interaction with the storage unit of other loop iterations can be separated. And mapping parallelizable loop structures (including loop structures without data dependency and loop structures after eliminating anti-dependency and output dependency) to structures suitable for CUDA and CPU multithread execution, and then generating corresponding CUDA codes and CPU multithread codes.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides an automatic fine-grained two-stage parallel translation method, which comprises the following steps:
s1: analyzing the serial source code by adopting a preset syntax analyzer to generate an abstract syntax tree corresponding to the source code, traversing the abstract syntax tree and extracting a cycle structure;
s2: analyzing the data dependency relationship in the extracted loop structure, judging whether the data dependency relationship exists or not, wherein the data dependency relationship comprises a flow dependency relationship, an anti-dependency relationship and an output dependency relationship, and if the data dependency relationship does not exist, directly performing parallelization processing on the loop structure; if the data dependency exists, processing according to the type of the data dependency, specifically including: when the data dependency is a stream dependency, the loop structure is marked as being incapable of being parallelized, and when the data dependency is an anti-dependency or an output dependency, the loop structure is processed by adopting an array privatization technology, wherein the processing of the loop structure by adopting the array privatization technology comprises the following steps: localizing a storage unit corresponding to loop iteration in a loop structure, and eliminating an anti-dependency relationship and an output dependency relationship caused by variable reuse;
s3: and mapping the parallelizable loop structure to a CUDA and CPU multithreading execution structure to generate corresponding CUDA codes and CPU multithreading codes, wherein the CPU creates threads with corresponding number, one thread is responsible for GPU scheduling, other threads are responsible for executing parallel tasks distributed to the CPU, and meanwhile, the GPU executes the tasks distributed in the GPU scheduling.
In a specific implementation process, the preset syntax parser is ANTLR, which means an open source syntax parser that can automatically generate a syntax tree according to an input and visually display the syntax tree. ANTLR-other Tool for Language Recognition, whose predecessor is PCCTS, provides a framework for automatically constructing a recognizer, a compiler and an interpreter of a custom Language by syntactic descriptions for languages including Java, C + +, C #.
The automatic fine-grained two-stage parallel translation method provided by the invention comprises three parts, namely analyzing a source C code through ANTLR, analyzing a data dependency relationship, and eliminating an anti-dependency relationship and mapping, and particularly refers to FIG. 1.
Step S1 parses the source C code through ANTLR, scans the source code, and may then automatically generate an extended back-Naur form (EBNF) syntax description. From the EBNF description, ANTLR generates the corresponding lexical and syntactic parsers for Abstract Syntax Trees (AST).
Step S2 analyzes the loop structure (loop information included) extracted from the parser. If flow dependencies are found, the loop statements containing these dependencies are not parallelizable. If the anti-dependencies and output dependencies between the data are found, the loop structure is processed to eliminate the dependencies. Such a loop statement is parallelizable if there are no data dependencies. When the anti-dependency relationship and the output dependency relationship between the data exist, the anti-dependency relationship and the output dependency relationship caused by variable reuse are eliminated, and the storage unit related to the loop iteration is localized by adopting an array privatization technology, so that the interaction with the storage unit of other loop iterations can be separated.
Step S3 maps the parallelizable loop structures to structures suitable for CUDA and CPU multithread execution, and then generates corresponding CUDA code and CPU multithread code. The multicore CPU creates a corresponding number of threads: one thread is responsible for GPU scheduling and the other threads perform parallel tasks assigned to the CPU. At the same time, the GPU performs the tasks assigned to it.
In one embodiment, step S1 includes:
s1.1: creating an EBNF description of the serial source code using an ANTLR tool;
s1.2: performing lexical analysis, matching characters in the EBNF description of the serial source code, masking or filtering irrelevant content, and generating a tag for syntactic analysis;
s1.3: executing syntax analysis, analyzing the generated marks, and generating an abstract syntax tree corresponding to the source code;
s1.4: and traversing the generated abstract syntax tree, and extracting a loop structure, wherein the loop structure comprises loop nesting levels and loop-related variable information, and the loop-related variable information comprises variable names, variable types and row numbers for storing loop positions.
Wherein, when the syntax analysis is executed in step S1.3, rule parameters are added to implement the transfer of context information.
In the specific implementation process, analyzing the source C code through ANTLR mainly includes the following contents:
ANTLR is an automatically generated tool for language recognition. It generates a syntactic description of the source C code using EBNF rules. It performs lexical analysis and syntax analysis on the source program according to the attribute of syntax, and then generates an AST (abstract syntax tree). ANTLR provides a mechanism to traverse the AST, which can help extract the circular related information. The present invention uses ANTLR to parse the source C code to generate AST and extract the loop-related information. The work flow of the resolver is shown in fig. 10.
First, step S1.1 is performed: an EBNF description of the serial source code is created using ANTLR. The syntax described by EBNF can be represented by a quadruplet:
G(Z)=(Vn,Vt,S,P) (1)
wherein, VnIs a non-terminal characterA limited set of numbers; vtIs a limited set of terminal symbols; s represents the starting symbol grammar; p represents a finite set of products (including also a set of rules). Z represents source code. The most important of the grammars is P (production set), with "a: and a' is expressed in a form. Capital letter a is the left-hand part of the production, representing a non-terminal character. The lower case letter a is on the right and may represent a non-terminal symbol and a terminal symbol.
Then step S1.2 is performed, where lexical analysis is performed, matching characters in the input stream, masking or filtering out irrelevant content, and generating tokens for syntactic analysis. To achieve this, ANTLR adds a series of filtering methods to the lexical grammar.
In source code, characters such as spaces, tabs, carriage returns, and line breaks are usually meaningless redundant characters. ANTLR provides a skip () method to skip these meaningless symbols. For example, use
WS:(”|'\t'|'\n'|'\r')+{skip();};
When traversing these characters, a skip () method will be called to skip the corresponding character. Annotations in the source code are meaningless at compile time and need to be reused when generating the final document. ANTLR provides a channel mechanism to hide annotations at compile time. For example, use
COMMENT:'/*'.*'*/'{$channel=HIDDEN;};
Matching comment blocks can be placed in the HIDDEN channel without appearing in subsequent parsing.
Then step S1.3 is performed, a syntax analysis is performed to analyze the flags of the previous step, ANTLR by default being based on no context rules. Preferably, the invention adds rule parameters to realize the transfer of the context information so as to make up the deficiency of the context-free grammar. For example, in FIG. 2, the code snippet determines whether the type of variable assignment meets requirements. In the variable declaration syntax, a rule parameter idList [ type. In order to extract variable information related to loops, the present invention adds a return value to a syntax expression of a loop statement so that information related to variables can be directly extracted from various loop statements. Fig. 3 shows an example of adding a return value int to a while statement declaration.
The AST is formed after the parsing. It stores the data structure of the sentence in the form of a tree, with each node on the tree representing a structure in the source code. Traversal of the serial C-code abstract syntax tree is mainly used to traverse the loop structure. In step S1.4, the AST is traversed using Visitors method provided by ANTLR. The invention reloads the visitForStatement () method, which stores the loop nesting level and the variable information sourceC code related to the loop, including the variable name, the variable type and the line number of the storage loop position. The collected variable information is then handed over to the next stage for processing.
In one embodiment, analyzing the extracted data dependency relationship in the loop structure to determine whether a data dependency relationship exists includes:
if one storage unit in the current loop structure is written in one iteration and then read in the subsequent iteration, the stream dependency relationship exists in the loop structure;
if a storage unit is read in one iteration in the current loop structure and then written in a storage unit in the subsequent iteration, the fact that the anti-dependency relationship exists in the loop structure is shown;
if a storage unit in the current loop structure is written in one iteration and then written again in the subsequent iteration, the output dependency relationship in the loop structure is shown to exist.
In particular, dependencies refer to partial order relationships of statements in a program that reflect the inherent order required to maintain the semantics of the program. The parallelism of the program is influenced by reading and writing access to data, so the dependence to be considered in parallel conversion is data dependence.
According to the read-write operation on the same memory area, the data dependency relationship can be composed of a stream dependency relationship, an anti-dependency relationship and an output dependency relationship. In a loop structure, a stream dependency refers to a memory cell being written in one iteration and then read in a subsequent iteration; an anti-dependency refers to reading a memory cell in one iteration and then writing a memory cell in a subsequent iteration. Output dependencies refer to a memory location being written to in one iteration and then written again in a subsequent iteration.
In the specific implementation process, it is assumed that in a loop statement F (loop structure), I is an iteration space, and I (I ∈ I) is a loop control variable of one iteration in I. Under iteration i, ReadiRepresents the set of all read variables, and WriteiRepresenting the set of all written variables. Then the sufficient conditions that F can be parallelized are:
Figure BDA0003390952210000081
equation (2) indicates that the loop structure does not have a stream dependency, and does not have an inverse dependency or an output dependency
Where I, j ∈ I and I ≠ j. If there is a flow dependency in F, then the condition should be satisfied:
Figure BDA0003390952210000082
where I, j ∈ I and I > j. A write operation on the same memory region will precede a read operation. It is similar to the relationship between producer and consumer. Loop structures containing stream dependencies cannot be executed in parallel on a GPU.
If there is an anti-dependency in F, the following condition should be satisfied:
Figure BDA0003390952210000083
where k, l ∈ I and k < l. A read operation to the same memory region occurs before a write operation due to repeated references to the same memory region. The purpose of automatic parallel translation can be achieved by creating a temporary storage area. If an output dependency exists in F (the loop structure F contains output dependencies), the following condition should be satisfied:
Figure BDA0003390952210000084
where m, n ∈ I and m ≠ n. A write operation on the same storage area occurs at least twice. For both the anti-dependency relationship and the output dependency relationship, the purpose of automatic parallel conversion can be realized by creating a temporary storage area. Specifically, the loop-related information extracted from the parser is used as an input, and then the data dependency analysis is performed on the input. Through data dependency analysis, if the anti-dependency or output dependency between data is found, the loop structure containing the dependencies is transferred to the next loop array privatization stage for processing. If a flow dependency is found, statements such as loop structures that cannot be parallelized are marked. If no data dependencies can be found, this loop structure can be parallelized directly.
In serial C code, reusing the same variables can pose a significant obstacle to automatic parallel conversion. The use of variables results in data dependencies, anti-dependencies and output dependencies of the memory addresses. The present invention designs a loop array privatization phase (array privatization technique) to eliminate these dependencies. The explicit representation of the memory locations in the loop statements are variables and arrays. A variable can also be considered as a special representation of an array, meaning that there is only one array of elements. In serial C code, a global array is typically used to store data, thereby reducing memory space. The global array will be used in each iteration in the loop statement. Loop array privatization privatizes the original reuse space using new storage space in each iteration, so there are no cross-iteration dependencies. In addition to the initialization of arrays, the locations of array reassignments in loop statements can be divided into three categories. One is to assign a new value to the array in the current iteration and then use it in the next iteration. The other is to assign a value to the array outside the loop and reuse it in iterations. The third category is that the array is allocated first and then reused in the same iteration. Fig. 4 shows an example of the first category. In this loop statement, when i ═ 0, the value "temp 2+ 2" is assigned to the array a in the fifth row. When i is 1, the array a is used in the fourth row of the sentence "temp 1 is a + 1". This is a loop with stream dependencies, so the loop cannot be translated by loop array privatization. Fig. 5 shows an example of the second category. In this code segment, array A is assigned the value "1" outside of the loop statement used in the fifth line of the loop. In this case, array A may be privatized, as shown in FIG. 6. Fig. 7 shows an example of the third category. In this loop statement, when i is 0, the array a is first assigned "temp 2+ 2" in the fourth row. It is then used in the statement "temp 1 ═ a + 1" in the fifth row. The array is allocated and reused in the same iteration. In this case, array A may be privatized, as shown in FIG. 8. The first condition to privatize a loop statement is that there are no flow dependencies between iterations. The second condition is to reallocate the array outside the loop, or before use in the same iteration. Combining the conditions of the anti-dependency relationship and the output dependency relationship, adopting the necessary conditions of array privatization:
Figure BDA0003390952210000091
Figure BDA0003390952210000092
Figure BDA0003390952210000093
wherein I, j, k, l, m, n belongs to I and I > j, l > k, m ≠ n
The above equation indicates that there is no stream dependency but there is an anti-dependency or output dependency.
In the specific implementation process, the mapping process of S3 is a two-level hybrid mapping, and in order not to waste any computing resource, the present invention expects to obtain a target two-level parallelization code: CPU multithreading and GPU CUDA, mapping the parallelizable loop structure to C multithreading code and CUDA code. FIG. 9 provides a simplified template of the mapping. The code structure after mapping is shown as a template. Marked as "parallel" are parallelizable blocks of code that are parsed and analyzed. The key "kthread" represents the number of threads that should be created on the CPU. "ctasks" and "gtasks" represent tasks to be processed on the CPU and GPU, respectively. The loop function includes a parallelizable loop structure (top of fig. 9), the CPU creates a corresponding number of threads, one of which is responsible for GPU scheduling, while the other threads execute parallel tasks. The CUDA kernel (bottom of fig. 9) is under the schedule of the CPU thread and contains the same parallelizable loops to execute. The mapping implementation process comprises the steps of firstly mapping serial codes of a loop structure into CUDA parallel codes, creating a CUDA core kernel function, and running on a GPU. Second, a thread is created on the CPU to schedule the GPU and execute the serial code.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (4)

1. An automatic fine-grained two-stage parallel translation method is characterized by comprising the following steps:
s1: analyzing the serial source code by adopting a preset syntax analyzer to generate an abstract syntax tree corresponding to the source code, traversing the abstract syntax tree and extracting a cycle structure;
s2: analyzing the data dependency relationship in the extracted loop structure, judging whether the data dependency relationship exists or not, wherein the data dependency relationship comprises a flow dependency relationship, an anti-dependency relationship and an output dependency relationship, and if the data dependency relationship does not exist, directly performing parallelization processing on the loop structure; if the data dependency exists, processing according to the type of the data dependency, specifically including: when the data dependency is a stream dependency, the loop structure is marked as being incapable of being parallelized, and when the data dependency is an anti-dependency or an output dependency, the loop structure is processed by adopting an array privatization technology, wherein the processing of the loop structure by adopting the array privatization technology comprises the following steps: localizing a storage unit corresponding to loop iteration in a loop structure, and eliminating an anti-dependency relationship and an output dependency relationship caused by variable reuse;
s3: and mapping the parallelizable loop structure to a CUDA and CPU multithreading execution structure to generate corresponding CUDA codes and CPU multithreading codes, wherein the CPU creates threads with corresponding number, one thread is responsible for GPU scheduling, other threads are responsible for executing parallel tasks distributed to the CPU, and meanwhile, the GPU executes the tasks distributed in the GPU scheduling.
2. The two-stage parallel translation method according to claim 1, wherein step S1 includes:
s1.1: creating an EBNF description of the serial source code using an ANTLR tool;
s1.2: performing lexical analysis, matching characters in the EBNF description of the serial source code, masking or filtering irrelevant content, and generating a tag for syntactic analysis;
s1.3: executing syntax analysis, analyzing the generated marks, and generating an abstract syntax tree corresponding to the source code;
s1.4: and traversing the generated abstract syntax tree, and extracting a loop structure, wherein the loop structure comprises loop nesting levels and loop-related variable information, and the loop-related variable information comprises variable names, variable types and row numbers for storing loop positions.
3. A method for two-level parallel translation as claimed in claim 1, characterized in that in step S1.3, when performing syntax analysis, rule parameters are added for enabling the transfer of context information.
4. The two-stage parallel translation method according to claim 1, wherein analyzing the extracted data dependency relationship in the loop structure to determine whether a data dependency relationship exists comprises:
if one storage unit in the current loop structure is written in one iteration and then read in the subsequent iteration, the stream dependency relationship exists in the loop structure;
if a storage unit is read in one iteration in the current loop structure and then written in a storage unit in the subsequent iteration, the fact that the anti-dependency relationship exists in the loop structure is shown;
if a storage unit in the current loop structure is written in one iteration and then written again in the subsequent iteration, the output dependency relationship in the loop structure is shown to exist.
CN202111464906.5A 2021-12-03 2021-12-03 Automatic fine-grained two-stage parallel translation method Pending CN114398039A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111464906.5A CN114398039A (en) 2021-12-03 2021-12-03 Automatic fine-grained two-stage parallel translation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111464906.5A CN114398039A (en) 2021-12-03 2021-12-03 Automatic fine-grained two-stage parallel translation method

Publications (1)

Publication Number Publication Date
CN114398039A true CN114398039A (en) 2022-04-26

Family

ID=81225279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111464906.5A Pending CN114398039A (en) 2021-12-03 2021-12-03 Automatic fine-grained two-stage parallel translation method

Country Status (1)

Country Link
CN (1) CN114398039A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117075909A (en) * 2023-10-11 2023-11-17 沐曦集成电路(南京)有限公司 Compiling method, electronic device and medium for realizing parallel programming
CN117311728A (en) * 2023-09-27 2023-12-29 北京计算机技术及应用研究所 OpenCL automatic translation method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117311728A (en) * 2023-09-27 2023-12-29 北京计算机技术及应用研究所 OpenCL automatic translation method
CN117075909A (en) * 2023-10-11 2023-11-17 沐曦集成电路(南京)有限公司 Compiling method, electronic device and medium for realizing parallel programming
CN117075909B (en) * 2023-10-11 2023-12-15 沐曦集成电路(南京)有限公司 Compiling method, electronic device and medium for realizing parallel programming

Similar Documents

Publication Publication Date Title
US7962904B2 (en) Dynamic parser
CN114398039A (en) Automatic fine-grained two-stage parallel translation method
US7917899B2 (en) Program development apparatus, method for developing a program, and a computer program product for executing an application for a program development apparatus
Liao et al. Semantic-aware automatic parallelization of modern applications using high-level abstractions
Yuki et al. Array dataflow analysis for polyhedral X10 programs
CN109491658A (en) The generation method and device of computer-executable code data
US20090328016A1 (en) Generalized expression trees
Zhao et al. Optimizing the memory hierarchy by compositing automatic transformations on computations and data
Blindell Instruction Selection
Weber et al. MATOG: array layout auto-tuning for CUDA
Metcalf The seven ages of fortran
CN117075909B (en) Compiling method, electronic device and medium for realizing parallel programming
Alsubhi et al. A Tool for Translating sequential source code to parallel code written in C++ and OpenACC
Shei et al. MATLAB parallelization through scalarization
Lin et al. Enable OpenCL compiler with Open64 infrastructures
Cohen et al. Instance-wise reaching definition analysis for recursive programs using context-free transductions
Grigorev et al. GLR-based abstract parsing
Bispo et al. Challenges and Opportunities in C/C++ Source-To-Source Compilation
Mey et al. Using semantics-aware composition and weaving for multi-variant progressive parallelization
Stanier Removing and restoring control flow with the value state dependence graph
JP2006338190A (en) Mounting code developing system, and mounting code developing program
CN116560667B (en) Splitting scheduling system and method based on precompiled delay execution
El-Zawawy Frequent statement and de-reference elimination for distributed programs
CN117093502B (en) Method and device for detecting parallelism of program codes
Meyerovich Parallel Layout Engines: Synthesis and Optimization of Tree Traversals

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination