CN109522126B

CN109522126B - Thread-level parallel data optimization method and device in shared memory multi-core structure

Info

Publication number: CN109522126B
Application number: CN201811376636.0A
Authority: CN
Inventors: 徐金龙
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2018-11-19
Filing date: 2018-11-19
Publication date: 2020-04-24
Anticipated expiration: 2038-11-19
Also published as: CN109522126A

Abstract

The invention belongs to the technical field of parallel optimization of a multi-core structure of a computer, and particularly relates to a data optimization method and a data optimization device for thread-level parallel in a shared memory multi-core structure, wherein the method comprises the following steps: detecting and judging whether all circulation carrying dependencies of the data to be optimized have privatization favorable conditions or not; and implementing data privatization processing according to the detection judgment result. Aiming at the circulation parallel threads in the multi-core hardware structure, the invention firstly carries out the favorable detection of the privatization processing on the data, and then eliminates the circulation carrying dependence through the data privatization, thereby realizing the multi-thread parallel of the program; and the data privatization is further divided into scalar privatization and group privatization, the multi-core hardware structure advantage of shared storage is fully utilized, the thread execution efficiency is improved, the utilization rate of the performance of a processor and even a computer system is improved, the operation cost is reduced, and the method has important guiding significance on the shared memory multi-core structure thread-level parallel processing technology.

Description

Thread-level parallel data optimization method and device in shared memory multi-core structure

Technical Field

The invention belongs to the technical field of parallel optimization of a multi-core structure of a computer, and particularly relates to a data optimization method and device for thread-level parallel in a shared memory multi-core structure.

Background

Currently, parallel architectures are not privileged to supercomputers, and modern processors often employ multi-core architectures to achieve higher performance. In order to fully exploit the hardware performance of such multi-core architectures, efficient thread-level parallelism needs to be developed. The OpenMP programming model is an efficient means of thread-level parallelism. As shown in fig. 2, in the execution model of OpenMP, only a main thread is running at the beginning of execution, during the running of the main thread, when an area requiring parallel computation is encountered, a plurality of threads are derived (form, create a new thread or wake up an existing thread) to execute parallel tasks, during the parallel execution, the main thread and the derived threads work together, after the execution of a parallel area is finished, the derived threads exit or suspend, no longer work, and the control flow returns to a single main thread (Join, i.e., the convergence of multiple threads). Multithreading parallelism of loops, i.e., the assignment of different loop iterations to different threads for execution, is conveniently achieved by OpenMP. Because the execution sequence of each thread is random, which is equivalent to the fact that the execution sequence of the original iteration is disturbed, the requirement that any dependency cannot exist among the iterations is required, and the dependency among the iterations is called as loop carried dependency in the following. The existence of any loop carrying dependency can prevent the parallelization of the program, and further influence the thread execution efficiency, the multi-core structure hardware operation performance and the overhead.

Disclosure of Invention

Therefore, the invention provides a data optimization method and device for thread-level parallel in a shared memory multi-core structure, which fully utilize the hardware performance of the multi-core structure, meet the requirement of circulating multithreading parallel, realize the effectiveness of thread-level parallel execution of circulating data, and improve the data execution efficiency of a multi-core processor.

According to the design scheme provided by the invention, the data optimization method facing the thread-level parallelism in the shared memory multi-core structure comprises the following steps: aiming at the circulation to be optimized, detecting and judging whether privatization favorable conditions exist in all circulation carrying dependence; and implementing data privatization processing according to the detection judgment result.

In the above, the privatization favorable condition refers to whether all loop carrying dependencies can be eliminated by data privatization.

The privatization favorable situation of all loop carrying dependencies is detected and determined, and includes the following contents:

for a target circulation layer of data to be optimized, if each reference in the target circulation layer is assigned by the current iteration, judging the dependence privatization processing caused by the scalar, and otherwise, judging that the scalar does not need the data privatization processing; for each array, if the array dependence is carried by a loop layer in a target loop layer, and each reference of each array element in the iteration of the target loop layer is detected to be assigned by the current iteration, determining dependence privatization processing caused by the array, otherwise, determining that the array does not need data privatization.

Preferably, the privatization processing by scalar is performed based on the detection result, and includes the following steps:

for the dependence caused by the scalar to be privatized, firstly traversing each statement in a loop body, searching and collecting the write operation aiming at the scalar, and adding the write operation into a set W, if the set W is empty, the scalar privatization is not needed, otherwise, traversing the set W, and detecting whether each scalar involved in the set is a reduction variable or not, if so, converting the reduction variable into an induction variable; and traversing the set W and carrying out privatization processing on each scalar involved in the set.

Further, the scalar is privatized by using a scalar extension array or privatization clause modification.

Preferably, the dependency privatization processing by the array is performed according to the detection judgment result, and includes the following contents:

for the dependence caused by the array to be privatized, firstly, traversing each array access statement of the cyclic weight, searching and collecting the write operation aiming at the array, and adding the write operation into the set AW; if the set AW is empty, array privatization is not needed, otherwise, the set AW is traversed, whether a specification exists in each array quote related to the set is detected, and if yes, the specification is transformed into induction; and traversing the set AW and carrying out privatization processing on each array involved in the set.

Further, the array is privatized by adopting array dimension expansion or privatization clause modification.

A data optimization device for thread-level parallelism in a shared memory multi-core structure comprises a test module and an optimization module, wherein,

the test module is used for detecting and judging whether all the circular carrying dependencies of the data to be optimized have privatization favorable conditions or not and feeding back the conditions needing privatization processing to the optimization module according to the detection judgment result;

and the optimization module is used for implementing data privatization processing according to the feedback result of the test module.

In the above-mentioned device, the test module includes a first detection sub-module and a second detection sub-module, wherein,

the detection submodule I is used for aiming at a target circulation layer of data to be optimized, if each reference in the target circulation layer is assigned by a current iteration, judging dependence privatization processing caused by a scalar, and otherwise, judging that the scalar does not need data privatization processing;

and the second detection submodule is used for aiming at the target loop layer of the data to be optimized, judging the dependency privatization processing caused by the array for each array if the array dependency is carried by the loop layer in the target loop layer, and detecting that each reference of each array element in the iteration of the target loop layer is assigned by the current iteration, otherwise, judging that the array does not need data privatization.

In the above apparatus, the optimization module comprises a first optimization submodule and a second optimization submodule, wherein,

the optimization submodule I is used for traversing each statement in a loop body according to a feedback result and aiming at the dependence caused by the scalar to be privatized, searching and collecting the write operation aiming at the scalar, adding the write operation into the set W, if the set W is empty, the scalar privatization is not needed, otherwise, traversing the set W, detecting whether each scalar involved in the set is a reduced variable, and if so, converting the reduced variable into an inductive variable; traversing the set W, and carrying out privatization processing on each scalar involved in the set;

the optimization submodule II is used for traversing each array access statement of the cyclic weight according to the feedback result and aiming at the dependence caused by the array to be privatized, searching and collecting the write operation aiming at the array in the loop, and adding the write operation into the set AW; if the set AW is empty, array privatization is not needed, otherwise, the set AW is traversed, whether a specification exists in each array quote related to the set is detected, and if yes, the specification is transformed into induction; and traversing the set AW and carrying out privatization processing on each array involved in the set.

The invention has the beneficial effects that:

aiming at the circulation parallel threads in the multi-core hardware structure, the invention firstly carries out the favorable detection of the privatization processing on the data, and then eliminates the circulation carrying dependence through the data privatization, thereby realizing the multi-thread parallel of the program; and the data privatization is further divided into scalar privatization and group privatization, the multi-core hardware structure advantage of shared storage is fully utilized, the thread execution efficiency is improved, the utilization rate of the performance of a processor and even a computer system is improved, the operation cost is reduced, and the method has important guiding significance on the shared memory multi-core structure thread-level parallel processing technology.

Description of the drawings:

FIG. 1 is a schematic flow chart of a data optimization method in an embodiment;

FIG. 2 is a diagram illustrating an OpenMP parallel execution model according to an embodiment;

FIG. 3 is a schematic diagram of a data optimization device in an embodiment;

FIG. 4 is a schematic diagram of an exemplary test module;

FIG. 5 is a schematic diagram of an optimization module in an embodiment.

The specific implementation mode is as follows:

in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions.

As shown in fig. 2, in the execution model of OpenMP, only a main thread is running at the beginning of execution, during the running of the main thread, when an area requiring parallel computation is encountered, a plurality of threads are derived (form, create a new thread or wake up an existing thread) to execute parallel tasks, during the parallel execution, the main thread and the derived threads work together, after the execution of a parallel area is finished, the derived threads exit or suspend, no longer work, and the control flow returns to a single main thread (Join, i.e., the convergence of multiple threads). Loop multithreading parallelism can be conveniently implemented by OpenMP. In the loop multithreading parallelism, the execution sequence of each thread is random, so that loop carrying dependence cannot exist among iterations, otherwise, the parallelization of the program can be prevented, and the execution efficiency and the running performance of a multi-core processor and the like are further influenced. In view of this, an embodiment of the present invention, as shown in fig. 1, provides a data optimization method for thread-level parallelism in a shared memory multi-core structure, including the following steps:

s101, detecting and judging whether all loop carrying dependencies of data to be optimized have privatization favorable conditions or not according to the data to be optimized;

s102, carrying out data privatization processing according to the detection judgment result.

In the embodiment of the invention, firstly, a privatization profitability test is carried out to judge whether all dependencies hindering parallel can be eliminated during privatization, then, the dependency elimination work is implemented through data privatization aiming at the test result, the dependency relationship among iterations is broken, the thread-level parallel of circulation is realized, the hardware running performance of a multi-core structure is fully utilized, and the program execution efficiency is improved.

When detecting and determining the privatization favorable condition of all loop carrying dependencies, in another embodiment of the present invention, the detection determination process is designed to include the following contents:

Since the privatization of data consumes memory space, it does not necessarily bring about a good effect. Therefore, before data privatization, whether all loop carrying dependencies can be eliminated by data privatization is detected; if yes, the real data privatization is continuously carried out, otherwise, the data privatization processing is stopped, and unnecessary program execution overhead is avoided. And testing the carrying dependence caused by the scalars, wherein the loop layer to be parallelized is called a target loop layer, and for each scalar, if each reference in the iteration of the target loop layer is a value calculated by the iteration, the dependence caused by the scalars can be eliminated through privatization, and the next profitability test is continued. For each array dependency, a loop layer carrying the dependency needs to be checked, the loop layer to be parallelized is called a target loop layer, and there are three cases: (1) dependencies are carried by loop layers other than the target loop layer; (2) dependencies are carried by the target loop layer; (3) dependencies are carried by the loop layer inside the target loop layer; for the first case, because the dependency is carried by the outer loop, the target loop layer has no carrying dependency, and the array does not need to be privatized; for the second case, the dependency is carried by the target loop layer, the profitability test fails by failing to eliminate the dependency by array privatization if the dependency is carried by the target loop layer; for the third case, if the value of each array element in the target loop layer iteration is calculated by the current iteration, the dependency can be eliminated through privatization, otherwise, the dependency cannot be eliminated, and the profitability test fails.

According to the detection result, the dependence privatization processing caused by the scalar is carried out, and in another embodiment of the invention, the scalar privatization processing comprises the following contents:

Furthermore, a scalar extension array or a privatization clause is used for modifying, such as private, the scalar is privatized.

According to the detection and determination result, the dependency privatization processing caused by the array is performed, in another embodiment of the present invention, the dependency privatization processing includes the following contents:

Furthermore, the array dimension expansion or privatization clause modification such as private is adopted to privatize the array.

When the ARRAY detects whether the data is a reduction, the reduction can be transformed into induction by judging whether the data is in the form of "ARRAY [ i ] ═ ARRAY [ i ] op XX", and if the reduction caused by the ARRAY exists. For each privatized array, if only read operation exists in the loop body, initializing each related array element, wherein the default value of each privatized array is 0, so that the situation that the original value is possibly inconsistent and the like is avoided, and further the program execution efficiency is influenced.

Based on the above data optimization method, an embodiment of the present invention further provides a data optimization apparatus for thread-level parallelism in a shared memory multi-core structure, as shown in fig. 3, including a testing module 101 and an optimizing module 102, wherein,

the test module 101 is used for detecting and judging whether all loop carrying dependencies of the data to be optimized have privatization favorable conditions or not, and feeding back the conditions needing privatization processing to the optimization module according to the detection judgment result;

and the optimization module 102 is configured to implement data privatization processing according to the test module feedback result.

In the above-described apparatus, referring to fig. 4, the test module 101 includes a first detection sub-module 1001 and a second detection sub-module 1002, wherein,

the detection submodule I1001 is used for aiming at a target loop layer of data to be optimized, and for each scalar in the target loop layer, if each reference in the target loop layer is assigned by a current iteration, determining dependence privatization processing caused by the scalar, and otherwise, determining that the scalar does not need data privatization processing;

and the second detection submodule 1002 is configured to, for each array in the target loop layer of the data to be optimized, determine that the dependency caused by the array is privatized if the array dependency is carried by the loop layer inside the target loop layer, and detect that each reference of each array element in the iteration of the target loop layer is assigned by the current iteration, otherwise, determine that the array does not need data privatization.

In the above-described apparatus, referring to fig. 5, the optimization module 102 comprises a first optimization submodule 2001 and a second optimization submodule 2002, wherein,

the optimization submodule I2001 is used for traversing each statement in a loop body according to a feedback result and aiming at dependence caused by a scalar to be privatized, searching and collecting write operation aiming at the scalar, adding the write operation into a set W, if the set W is empty, the scalar is not required to be privatized, otherwise, traversing the set W, detecting whether each scalar involved in the set is a reduction variable, and if yes, converting the reduction variable into an induction variable; traversing the set W, and carrying out privatization processing on each scalar involved in the set;

the second optimization submodule 2002 is used for traversing each array access statement of the loop weight according to the feedback result and aiming at the dependence caused by the array to be privatized, searching and collecting the write operation aiming at the array, and adding the write operation into the set AW; if the set AW is empty, array privatization is not needed, otherwise, the set AW is traversed, whether a specification exists in each array quote related to the set is detected, and if yes, the specification is transformed into induction; and traversing the set AW and carrying out privatization processing on each array involved in the set.

Based on the above, in the embodiment of the present invention, further description is made by referring to the pseudo code content and the specific examples, and the content and examples of the scalar and array privatization processing pseudo code design are as follows:

scalar privatization pseudocode:

scalar privatization instance:

array privatization pseudo code:

an example of array privatization is as follows:

based on the foregoing method, an embodiment of the present invention further provides a server, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method described above.

Based on the above method, the embodiment of the present invention further provides a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the above method.

Unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.

The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In all examples shown and described herein, any particular value should be construed as merely exemplary, and not as a limitation, and thus other examples of example embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A data optimization method for thread-level parallelism in a shared memory multi-core structure is characterized by comprising the following steps:

aiming at the circulation to be optimized, detecting and judging whether privatization favorable conditions exist in all circulation carrying dependence; and implementing data privatization processing according to the detection judgment result;

detecting and judging the privatization favorable condition of all loop carrying dependence, comprising the following contents:

2. The method for optimizing data oriented to thread-level parallelism in a shared-memory multi-core structure according to claim 1, wherein the privatization favorable condition is whether all loop-carried dependencies can be eliminated by data privatization.

3. The method of claim 1, wherein the privatization processing of scalar-induced dependencies is performed according to the detection result, and the method comprises:

4. The method for optimizing data of thread-level parallelism in a shared-memory multi-core structure according to claim 1 or 3, wherein a scalar is privatized by using a scalar extension array or privatization clause modification.

5. The method for optimizing data oriented to thread-level parallelism in a shared-memory multi-core structure according to claim 1, wherein dependent privatization processing caused by arrays is performed according to the detection judgment result, and the method comprises the following steps:

for the dependence caused by the array to be privatized, firstly traversing each array access statement in the loop body, searching and collecting the write operation aiming at the array in the loop body, and adding the write operation into the set AW; if the set AW is empty, array privatization is not needed, otherwise, the set AW is traversed, whether a specification exists in each array quote related to the set is detected, and if yes, the specification is transformed into induction; and traversing the set AW and carrying out privatization processing on each array involved in the set.

6. The method for optimizing data of thread-level parallelism in a shared-memory multi-core structure according to claim 1 or 5, wherein the privatization processing is performed on the array by adopting array dimension expansion or privatization clause modification.

7. A data optimization device for thread-level parallelism in a shared memory multi-core structure is realized based on the data optimization method of claim 1, and comprises a test module and an optimization module, wherein,

8. The device for optimizing data in parallel at thread level in a shared memory multi-core structure according to claim 7, wherein the test module comprises a first detection submodule and a second detection submodule, wherein,

9. The shared-memory multi-core structure-oriented thread-level parallelism data optimization device of claim 7, wherein the optimization module comprises a first optimization submodule and a second optimization submodule, wherein,

the optimization submodule II is used for traversing each array access statement in the loop body according to the feedback result and aiming at the dependence caused by the array to be privatized, searching and collecting the write operation aiming at the array in the loop body, and adding the write operation into the set AW; if the set AW is empty, array privatization is not needed, otherwise, the set AW is traversed, whether a specification exists in each array quote related to the set is detected, and if yes, the specification is transformed into induction; and traversing the set AW and carrying out privatization processing on each array involved in the set.