CN109522126B - Thread-level parallel data optimization method and device in shared memory multi-core structure - Google Patents

Thread-level parallel data optimization method and device in shared memory multi-core structure Download PDF

Info

Publication number
CN109522126B
CN109522126B CN201811376636.0A CN201811376636A CN109522126B CN 109522126 B CN109522126 B CN 109522126B CN 201811376636 A CN201811376636 A CN 201811376636A CN 109522126 B CN109522126 B CN 109522126B
Authority
CN
China
Prior art keywords
privatization
array
data
scalar
thread
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201811376636.0A
Other languages
Chinese (zh)
Other versions
CN109522126A (en
Inventor
徐金龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Engineering University of PLA Strategic Support Force
Original Assignee
Information Engineering University of PLA Strategic Support Force
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Engineering University of PLA Strategic Support Force filed Critical Information Engineering University of PLA Strategic Support Force
Priority to CN201811376636.0A priority Critical patent/CN109522126B/en
Publication of CN109522126A publication Critical patent/CN109522126A/en
Application granted granted Critical
Publication of CN109522126B publication Critical patent/CN109522126B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)

Abstract

The invention belongs to the technical field of parallel optimization of a multi-core structure of a computer, and particularly relates to a data optimization method and a data optimization device for thread-level parallel in a shared memory multi-core structure, wherein the method comprises the following steps: detecting and judging whether all circulation carrying dependencies of the data to be optimized have privatization favorable conditions or not; and implementing data privatization processing according to the detection judgment result. Aiming at the circulation parallel threads in the multi-core hardware structure, the invention firstly carries out the favorable detection of the privatization processing on the data, and then eliminates the circulation carrying dependence through the data privatization, thereby realizing the multi-thread parallel of the program; and the data privatization is further divided into scalar privatization and group privatization, the multi-core hardware structure advantage of shared storage is fully utilized, the thread execution efficiency is improved, the utilization rate of the performance of a processor and even a computer system is improved, the operation cost is reduced, and the method has important guiding significance on the shared memory multi-core structure thread-level parallel processing technology.

Description

Thread-level parallel data optimization method and device in shared memory multi-core structure
Technical Field
The invention belongs to the technical field of parallel optimization of a multi-core structure of a computer, and particularly relates to a data optimization method and device for thread-level parallel in a shared memory multi-core structure.
Background
Currently, parallel architectures are not privileged to supercomputers, and modern processors often employ multi-core architectures to achieve higher performance. In order to fully exploit the hardware performance of such multi-core architectures, efficient thread-level parallelism needs to be developed. The OpenMP programming model is an efficient means of thread-level parallelism. As shown in fig. 2, in the execution model of OpenMP, only a main thread is running at the beginning of execution, during the running of the main thread, when an area requiring parallel computation is encountered, a plurality of threads are derived (form, create a new thread or wake up an existing thread) to execute parallel tasks, during the parallel execution, the main thread and the derived threads work together, after the execution of a parallel area is finished, the derived threads exit or suspend, no longer work, and the control flow returns to a single main thread (Join, i.e., the convergence of multiple threads). Multithreading parallelism of loops, i.e., the assignment of different loop iterations to different threads for execution, is conveniently achieved by OpenMP. Because the execution sequence of each thread is random, which is equivalent to the fact that the execution sequence of the original iteration is disturbed, the requirement that any dependency cannot exist among the iterations is required, and the dependency among the iterations is called as loop carried dependency in the following. The existence of any loop carrying dependency can prevent the parallelization of the program, and further influence the thread execution efficiency, the multi-core structure hardware operation performance and the overhead.
Disclosure of Invention
Therefore, the invention provides a data optimization method and device for thread-level parallel in a shared memory multi-core structure, which fully utilize the hardware performance of the multi-core structure, meet the requirement of circulating multithreading parallel, realize the effectiveness of thread-level parallel execution of circulating data, and improve the data execution efficiency of a multi-core processor.
According to the design scheme provided by the invention, the data optimization method facing the thread-level parallelism in the shared memory multi-core structure comprises the following steps: aiming at the circulation to be optimized, detecting and judging whether privatization favorable conditions exist in all circulation carrying dependence; and implementing data privatization processing according to the detection judgment result.
In the above, the privatization favorable condition refers to whether all loop carrying dependencies can be eliminated by data privatization.
The privatization favorable situation of all loop carrying dependencies is detected and determined, and includes the following contents:
for a target circulation layer of data to be optimized, if each reference in the target circulation layer is assigned by the current iteration, judging the dependence privatization processing caused by the scalar, and otherwise, judging that the scalar does not need the data privatization processing; for each array, if the array dependence is carried by a loop layer in a target loop layer, and each reference of each array element in the iteration of the target loop layer is detected to be assigned by the current iteration, determining dependence privatization processing caused by the array, otherwise, determining that the array does not need data privatization.
Preferably, the privatization processing by scalar is performed based on the detection result, and includes the following steps:
for the dependence caused by the scalar to be privatized, firstly traversing each statement in a loop body, searching and collecting the write operation aiming at the scalar, and adding the write operation into a set W, if the set W is empty, the scalar privatization is not needed, otherwise, traversing the set W, and detecting whether each scalar involved in the set is a reduction variable or not, if so, converting the reduction variable into an induction variable; and traversing the set W and carrying out privatization processing on each scalar involved in the set.
Further, the scalar is privatized by using a scalar extension array or privatization clause modification.
Preferably, the dependency privatization processing by the array is performed according to the detection judgment result, and includes the following contents:
for the dependence caused by the array to be privatized, firstly, traversing each array access statement of the cyclic weight, searching and collecting the write operation aiming at the array, and adding the write operation into the set AW; if the set AW is empty, array privatization is not needed, otherwise, the set AW is traversed, whether a specification exists in each array quote related to the set is detected, and if yes, the specification is transformed into induction; and traversing the set AW and carrying out privatization processing on each array involved in the set.
Further, the array is privatized by adopting array dimension expansion or privatization clause modification.
A data optimization device for thread-level parallelism in a shared memory multi-core structure comprises a test module and an optimization module, wherein,
the test module is used for detecting and judging whether all the circular carrying dependencies of the data to be optimized have privatization favorable conditions or not and feeding back the conditions needing privatization processing to the optimization module according to the detection judgment result;
and the optimization module is used for implementing data privatization processing according to the feedback result of the test module.
In the above-mentioned device, the test module includes a first detection sub-module and a second detection sub-module, wherein,
the detection submodule I is used for aiming at a target circulation layer of data to be optimized, if each reference in the target circulation layer is assigned by a current iteration, judging dependence privatization processing caused by a scalar, and otherwise, judging that the scalar does not need data privatization processing;
and the second detection submodule is used for aiming at the target loop layer of the data to be optimized, judging the dependency privatization processing caused by the array for each array if the array dependency is carried by the loop layer in the target loop layer, and detecting that each reference of each array element in the iteration of the target loop layer is assigned by the current iteration, otherwise, judging that the array does not need data privatization.
In the above apparatus, the optimization module comprises a first optimization submodule and a second optimization submodule, wherein,
the optimization submodule I is used for traversing each statement in a loop body according to a feedback result and aiming at the dependence caused by the scalar to be privatized, searching and collecting the write operation aiming at the scalar, adding the write operation into the set W, if the set W is empty, the scalar privatization is not needed, otherwise, traversing the set W, detecting whether each scalar involved in the set is a reduced variable, and if so, converting the reduced variable into an inductive variable; traversing the set W, and carrying out privatization processing on each scalar involved in the set;
the optimization submodule II is used for traversing each array access statement of the cyclic weight according to the feedback result and aiming at the dependence caused by the array to be privatized, searching and collecting the write operation aiming at the array in the loop, and adding the write operation into the set AW; if the set AW is empty, array privatization is not needed, otherwise, the set AW is traversed, whether a specification exists in each array quote related to the set is detected, and if yes, the specification is transformed into induction; and traversing the set AW and carrying out privatization processing on each array involved in the set.
The invention has the beneficial effects that:
aiming at the circulation parallel threads in the multi-core hardware structure, the invention firstly carries out the favorable detection of the privatization processing on the data, and then eliminates the circulation carrying dependence through the data privatization, thereby realizing the multi-thread parallel of the program; and the data privatization is further divided into scalar privatization and group privatization, the multi-core hardware structure advantage of shared storage is fully utilized, the thread execution efficiency is improved, the utilization rate of the performance of a processor and even a computer system is improved, the operation cost is reduced, and the method has important guiding significance on the shared memory multi-core structure thread-level parallel processing technology.
Description of the drawings:
FIG. 1 is a schematic flow chart of a data optimization method in an embodiment;
FIG. 2 is a diagram illustrating an OpenMP parallel execution model according to an embodiment;
FIG. 3 is a schematic diagram of a data optimization device in an embodiment;
FIG. 4 is a schematic diagram of an exemplary test module;
FIG. 5 is a schematic diagram of an optimization module in an embodiment.
The specific implementation mode is as follows:
in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions.
As shown in fig. 2, in the execution model of OpenMP, only a main thread is running at the beginning of execution, during the running of the main thread, when an area requiring parallel computation is encountered, a plurality of threads are derived (form, create a new thread or wake up an existing thread) to execute parallel tasks, during the parallel execution, the main thread and the derived threads work together, after the execution of a parallel area is finished, the derived threads exit or suspend, no longer work, and the control flow returns to a single main thread (Join, i.e., the convergence of multiple threads). Loop multithreading parallelism can be conveniently implemented by OpenMP. In the loop multithreading parallelism, the execution sequence of each thread is random, so that loop carrying dependence cannot exist among iterations, otherwise, the parallelization of the program can be prevented, and the execution efficiency and the running performance of a multi-core processor and the like are further influenced. In view of this, an embodiment of the present invention, as shown in fig. 1, provides a data optimization method for thread-level parallelism in a shared memory multi-core structure, including the following steps:
s101, detecting and judging whether all loop carrying dependencies of data to be optimized have privatization favorable conditions or not according to the data to be optimized;
s102, carrying out data privatization processing according to the detection judgment result.
In the embodiment of the invention, firstly, a privatization profitability test is carried out to judge whether all dependencies hindering parallel can be eliminated during privatization, then, the dependency elimination work is implemented through data privatization aiming at the test result, the dependency relationship among iterations is broken, the thread-level parallel of circulation is realized, the hardware running performance of a multi-core structure is fully utilized, and the program execution efficiency is improved.
When detecting and determining the privatization favorable condition of all loop carrying dependencies, in another embodiment of the present invention, the detection determination process is designed to include the following contents:
for a target circulation layer of data to be optimized, if each reference in the target circulation layer is assigned by the current iteration, judging the dependence privatization processing caused by the scalar, and otherwise, judging that the scalar does not need the data privatization processing; for each array, if the array dependence is carried by a loop layer in a target loop layer, and each reference of each array element in the iteration of the target loop layer is detected to be assigned by the current iteration, determining dependence privatization processing caused by the array, otherwise, determining that the array does not need data privatization.
Since the privatization of data consumes memory space, it does not necessarily bring about a good effect. Therefore, before data privatization, whether all loop carrying dependencies can be eliminated by data privatization is detected; if yes, the real data privatization is continuously carried out, otherwise, the data privatization processing is stopped, and unnecessary program execution overhead is avoided. And testing the carrying dependence caused by the scalars, wherein the loop layer to be parallelized is called a target loop layer, and for each scalar, if each reference in the iteration of the target loop layer is a value calculated by the iteration, the dependence caused by the scalars can be eliminated through privatization, and the next profitability test is continued. For each array dependency, a loop layer carrying the dependency needs to be checked, the loop layer to be parallelized is called a target loop layer, and there are three cases: (1) dependencies are carried by loop layers other than the target loop layer; (2) dependencies are carried by the target loop layer; (3) dependencies are carried by the loop layer inside the target loop layer; for the first case, because the dependency is carried by the outer loop, the target loop layer has no carrying dependency, and the array does not need to be privatized; for the second case, the dependency is carried by the target loop layer, the profitability test fails by failing to eliminate the dependency by array privatization if the dependency is carried by the target loop layer; for the third case, if the value of each array element in the target loop layer iteration is calculated by the current iteration, the dependency can be eliminated through privatization, otherwise, the dependency cannot be eliminated, and the profitability test fails.
According to the detection result, the dependence privatization processing caused by the scalar is carried out, and in another embodiment of the invention, the scalar privatization processing comprises the following contents:
for the dependence caused by the scalar to be privatized, firstly traversing each statement in a loop body, searching and collecting the write operation aiming at the scalar, and adding the write operation into a set W, if the set W is empty, the scalar privatization is not needed, otherwise, traversing the set W, and detecting whether each scalar involved in the set is a reduction variable or not, if so, converting the reduction variable into an induction variable; and traversing the set W and carrying out privatization processing on each scalar involved in the set.
Furthermore, a scalar extension array or a privatization clause is used for modifying, such as private, the scalar is privatized.
According to the detection and determination result, the dependency privatization processing caused by the array is performed, in another embodiment of the present invention, the dependency privatization processing includes the following contents:
for the dependence caused by the array to be privatized, firstly, traversing each array access statement of the cyclic weight, searching and collecting the write operation aiming at the array, and adding the write operation into the set AW; if the set AW is empty, array privatization is not needed, otherwise, the set AW is traversed, whether a specification exists in each array quote related to the set is detected, and if yes, the specification is transformed into induction; and traversing the set AW and carrying out privatization processing on each array involved in the set.
Furthermore, the array dimension expansion or privatization clause modification such as private is adopted to privatize the array.
When the ARRAY detects whether the data is a reduction, the reduction can be transformed into induction by judging whether the data is in the form of "ARRAY [ i ] ═ ARRAY [ i ] op XX", and if the reduction caused by the ARRAY exists. For each privatized array, if only read operation exists in the loop body, initializing each related array element, wherein the default value of each privatized array is 0, so that the situation that the original value is possibly inconsistent and the like is avoided, and further the program execution efficiency is influenced.
Based on the above data optimization method, an embodiment of the present invention further provides a data optimization apparatus for thread-level parallelism in a shared memory multi-core structure, as shown in fig. 3, including a testing module 101 and an optimizing module 102, wherein,
the test module 101 is used for detecting and judging whether all loop carrying dependencies of the data to be optimized have privatization favorable conditions or not, and feeding back the conditions needing privatization processing to the optimization module according to the detection judgment result;
and the optimization module 102 is configured to implement data privatization processing according to the test module feedback result.
In the above-described apparatus, referring to fig. 4, the test module 101 includes a first detection sub-module 1001 and a second detection sub-module 1002, wherein,
the detection submodule I1001 is used for aiming at a target loop layer of data to be optimized, and for each scalar in the target loop layer, if each reference in the target loop layer is assigned by a current iteration, determining dependence privatization processing caused by the scalar, and otherwise, determining that the scalar does not need data privatization processing;
and the second detection submodule 1002 is configured to, for each array in the target loop layer of the data to be optimized, determine that the dependency caused by the array is privatized if the array dependency is carried by the loop layer inside the target loop layer, and detect that each reference of each array element in the iteration of the target loop layer is assigned by the current iteration, otherwise, determine that the array does not need data privatization.
In the above-described apparatus, referring to fig. 5, the optimization module 102 comprises a first optimization submodule 2001 and a second optimization submodule 2002, wherein,
the optimization submodule I2001 is used for traversing each statement in a loop body according to a feedback result and aiming at dependence caused by a scalar to be privatized, searching and collecting write operation aiming at the scalar, adding the write operation into a set W, if the set W is empty, the scalar is not required to be privatized, otherwise, traversing the set W, detecting whether each scalar involved in the set is a reduction variable, and if yes, converting the reduction variable into an induction variable; traversing the set W, and carrying out privatization processing on each scalar involved in the set;
the second optimization submodule 2002 is used for traversing each array access statement of the loop weight according to the feedback result and aiming at the dependence caused by the array to be privatized, searching and collecting the write operation aiming at the array, and adding the write operation into the set AW; if the set AW is empty, array privatization is not needed, otherwise, the set AW is traversed, whether a specification exists in each array quote related to the set is detected, and if yes, the specification is transformed into induction; and traversing the set AW and carrying out privatization processing on each array involved in the set.
Based on the above, in the embodiment of the present invention, further description is made by referring to the pseudo code content and the specific examples, and the content and examples of the scalar and array privatization processing pseudo code design are as follows:
scalar privatization pseudocode:
Figure BDA0001870909460000081
scalar privatization instance:
Figure BDA0001870909460000082
array privatization pseudo code:
Figure BDA0001870909460000091
an example of array privatization is as follows:
Figure BDA0001870909460000101
based on the foregoing method, an embodiment of the present invention further provides a server, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method described above.
Based on the above method, the embodiment of the present invention further provides a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the above method.
Unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.
The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In all examples shown and described herein, any particular value should be construed as merely exemplary, and not as a limitation, and thus other examples of example embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (9)

1. A data optimization method for thread-level parallelism in a shared memory multi-core structure is characterized by comprising the following steps:
aiming at the circulation to be optimized, detecting and judging whether privatization favorable conditions exist in all circulation carrying dependence; and implementing data privatization processing according to the detection judgment result;
detecting and judging the privatization favorable condition of all loop carrying dependence, comprising the following contents:
for a target circulation layer of data to be optimized, if each reference in the target circulation layer is assigned by the current iteration, judging the dependence privatization processing caused by the scalar, and otherwise, judging that the scalar does not need the data privatization processing; for each array, if the array dependence is carried by a loop layer in a target loop layer, and each reference of each array element in the iteration of the target loop layer is detected to be assigned by the current iteration, determining dependence privatization processing caused by the array, otherwise, determining that the array does not need data privatization.
2. The method for optimizing data oriented to thread-level parallelism in a shared-memory multi-core structure according to claim 1, wherein the privatization favorable condition is whether all loop-carried dependencies can be eliminated by data privatization.
3. The method of claim 1, wherein the privatization processing of scalar-induced dependencies is performed according to the detection result, and the method comprises:
for the dependence caused by the scalar to be privatized, firstly traversing each statement in a loop body, searching and collecting the write operation aiming at the scalar, and adding the write operation into a set W, if the set W is empty, the scalar privatization is not needed, otherwise, traversing the set W, and detecting whether each scalar involved in the set is a reduction variable or not, if so, converting the reduction variable into an induction variable; and traversing the set W and carrying out privatization processing on each scalar involved in the set.
4. The method for optimizing data of thread-level parallelism in a shared-memory multi-core structure according to claim 1 or 3, wherein a scalar is privatized by using a scalar extension array or privatization clause modification.
5. The method for optimizing data oriented to thread-level parallelism in a shared-memory multi-core structure according to claim 1, wherein dependent privatization processing caused by arrays is performed according to the detection judgment result, and the method comprises the following steps:
for the dependence caused by the array to be privatized, firstly traversing each array access statement in the loop body, searching and collecting the write operation aiming at the array in the loop body, and adding the write operation into the set AW; if the set AW is empty, array privatization is not needed, otherwise, the set AW is traversed, whether a specification exists in each array quote related to the set is detected, and if yes, the specification is transformed into induction; and traversing the set AW and carrying out privatization processing on each array involved in the set.
6. The method for optimizing data of thread-level parallelism in a shared-memory multi-core structure according to claim 1 or 5, wherein the privatization processing is performed on the array by adopting array dimension expansion or privatization clause modification.
7. A data optimization device for thread-level parallelism in a shared memory multi-core structure is realized based on the data optimization method of claim 1, and comprises a test module and an optimization module, wherein,
the test module is used for detecting and judging whether all the circular carrying dependencies of the data to be optimized have privatization favorable conditions or not and feeding back the conditions needing privatization processing to the optimization module according to the detection judgment result;
and the optimization module is used for implementing data privatization processing according to the feedback result of the test module.
8. The device for optimizing data in parallel at thread level in a shared memory multi-core structure according to claim 7, wherein the test module comprises a first detection submodule and a second detection submodule, wherein,
the detection submodule I is used for aiming at a target circulation layer of data to be optimized, if each reference in the target circulation layer is assigned by a current iteration, judging dependence privatization processing caused by a scalar, and otherwise, judging that the scalar does not need data privatization processing;
and the second detection submodule is used for aiming at the target loop layer of the data to be optimized, judging the dependency privatization processing caused by the array for each array if the array dependency is carried by the loop layer in the target loop layer, and detecting that each reference of each array element in the iteration of the target loop layer is assigned by the current iteration, otherwise, judging that the array does not need data privatization.
9. The shared-memory multi-core structure-oriented thread-level parallelism data optimization device of claim 7, wherein the optimization module comprises a first optimization submodule and a second optimization submodule, wherein,
the optimization submodule I is used for traversing each statement in a loop body according to a feedback result and aiming at the dependence caused by the scalar to be privatized, searching and collecting the write operation aiming at the scalar, adding the write operation into the set W, if the set W is empty, the scalar privatization is not needed, otherwise, traversing the set W, detecting whether each scalar involved in the set is a reduced variable, and if so, converting the reduced variable into an inductive variable; traversing the set W, and carrying out privatization processing on each scalar involved in the set;
the optimization submodule II is used for traversing each array access statement in the loop body according to the feedback result and aiming at the dependence caused by the array to be privatized, searching and collecting the write operation aiming at the array in the loop body, and adding the write operation into the set AW; if the set AW is empty, array privatization is not needed, otherwise, the set AW is traversed, whether a specification exists in each array quote related to the set is detected, and if yes, the specification is transformed into induction; and traversing the set AW and carrying out privatization processing on each array involved in the set.
CN201811376636.0A 2018-11-19 2018-11-19 Thread-level parallel data optimization method and device in shared memory multi-core structure Expired - Fee Related CN109522126B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811376636.0A CN109522126B (en) 2018-11-19 2018-11-19 Thread-level parallel data optimization method and device in shared memory multi-core structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811376636.0A CN109522126B (en) 2018-11-19 2018-11-19 Thread-level parallel data optimization method and device in shared memory multi-core structure

Publications (2)

Publication Number Publication Date
CN109522126A CN109522126A (en) 2019-03-26
CN109522126B true CN109522126B (en) 2020-04-24

Family

ID=65778244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811376636.0A Expired - Fee Related CN109522126B (en) 2018-11-19 2018-11-19 Thread-level parallel data optimization method and device in shared memory multi-core structure

Country Status (1)

Country Link
CN (1) CN109522126B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110795106B (en) * 2019-10-30 2022-10-04 中国人民解放军战略支援部队信息工程大学 Dynamic and static combined memory alias analysis processing method and device in program vectorization process
CN113778518B (en) * 2021-08-31 2024-03-26 中科曙光国际信息产业有限公司 Data processing method, device, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107438829A (en) * 2015-04-08 2017-12-05 华为技术有限公司 Partitioned storage data set redoes log record
CN107885531A (en) * 2017-11-28 2018-04-06 昆山青石计算机有限公司 A kind of concurrent big data real-time processing method based on array privatization

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944014B (en) * 2010-09-15 2013-08-21 复旦大学 Method for realizing automatic pipeline parallelism
CN105242929B (en) * 2015-10-13 2018-07-17 西安交通大学 A kind of design method of binary program automatically parallelizing for multi-core platform

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107438829A (en) * 2015-04-08 2017-12-05 华为技术有限公司 Partitioned storage data set redoes log record
CN107885531A (en) * 2017-11-28 2018-04-06 昆山青石计算机有限公司 A kind of concurrent big data real-time processing method based on array privatization

Also Published As

Publication number Publication date
CN109522126A (en) 2019-03-26

Similar Documents

Publication Publication Date Title
US8104030B2 (en) Mechanism to restrict parallelization of loops
US8726251B2 (en) Pipelined loop parallelization with pre-computations
US8561046B2 (en) Pipelined parallelization with localized self-helper threading
Chen et al. Free launch: optimizing GPU dynamic kernel launches through thread reuse
Virouleau et al. Evaluation of OpenMP dependent tasks with the KASTORS benchmark suite
US7458065B2 (en) Selection of spawning pairs for a speculative multithreaded processor
Tian et al. Speculative parallelization using state separation and multiple value prediction
US20090158248A1 (en) Compiler and Runtime for Heterogeneous Multiprocessor Systems
TWI733798B (en) An apparatus and method for managing address collisions when performing vector operations
US6892380B2 (en) Method for software pipelining of irregular conditional control loops
Chen et al. Register allocation for intel processor graphics
CN109522126B (en) Thread-level parallel data optimization method and device in shared memory multi-core structure
Samadi et al. Paragon: Collaborative speculative loop execution on gpu and cpu
Sanjuan-Estrada et al. Adaptive parallel interval branch and bound algorithms based on their performance for multicore architectures
Anantpur et al. Runtime dependence computation and execution of loops on heterogeneous systems
US9921838B2 (en) System and method for managing static divergence in a SIMD computing architecture
Kerr et al. Dynamic compilation of data-parallel kernels for vector processors
Breß et al. A framework for cost based optimization of hybrid CPU/GPU query plans in database systems
WO2022048191A1 (en) Method and apparatus for reusable and relative indexed register resource allocation in function calls
US9665354B2 (en) Apparatus and method for translating multithread program code
Barthou et al. SPAGHETtI: Scheduling/placement approach for task-graphs on HETerogeneous architecture
Berned et al. Combining thread throttling and mapping to optimize the edp of parallel applications
Puiggali et al. Dynamic branch speculation in a speculative parallelization architecture for computer clusters
Ashraf et al. Hybrid model based testing tool architecture for exascale computing system
Dheeraj et al. Optimization of automatic conversion of serial C to parallel OpenMP

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200424

Termination date: 20201119