CN109522126A

CN109522126A - Data optimization methods and device towards Thread-Level Parallelism in shared drive coenocytism

Info

Publication number: CN109522126A
Application number: CN201811376636.0A
Authority: CN
Inventors: 徐金龙
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2018-11-19
Filing date: 2018-11-19
Publication date: 2019-03-26
Anticipated expiration: 2038-11-19
Also published as: CN109522126B

Abstract

The invention belongs to computer coenocytism parallel optimization technique fields, in particular to a kind of data optimization methods and device towards Thread-Level Parallelism in shared drive coenocytism, this method includes: being directed to data to be optimized, detecting and determining that its all circulation carries to rely on whether there is the advantageous implementations of privatization；And determined according to detection as a result, implementing data privatization processing.The present invention carries out the profitable of privatization processing to data first and detects for parallel thread is recycled in multicore hardware configuration, and circulation is then eliminated by data privatization and carries the multi-threaded parallel for relying on, and then realizing program；And scalar privatization and array privatization further are divided into data privatization, make full use of the multicore hardware configuration advantage of shared storage, improve thread execution efficiency, and improve the utilization rate of processor or even computer system performance, operation expense is reduced, there is important directive significance to shared drive coenocytism Thread-Level Parallelism processing technique.

Description

Data optimization methods and device towards Thread-Level Parallelism in shared drive coenocytism

Technical field

The invention belongs to computer coenocytism parallel optimization technique fields, in particular to one kind is towards shared drive multicore The data optimization methods and device of Thread-Level Parallelism in structure.

Background technique

Currently, parallel architecture has not been the privilege of supercomputer, and modern processors generally use coenocytism To obtain higher performance.In order to make full use of the hardware performance of this coenocytism, need to develop effective Thread-Level Parallelism Program.OpenMP programming model is the effective means of Thread-Level Parallelism.The execution model of OpenMP is as shown in Fig. 2, starting to hold When row, only main thread is being run, and main thread in the process of running, when encountering the region for needing to carry out parallel computation, is sent (Fork creates new thread or wakes up existing thread) multiple threads are born to execute parallel task, when parallel execute, Main thread and derivation thread work together, and after parallel end of extent execution, derive from thread and exit or hang up, no longer work, control Process processed returns in individual main thread (Join, i.e. multithreading converge).Circulation is realized with can be convenient by OpenMP Multi-threaded parallel, that is, different loop iterations is assigned to different threads and is executed up.Sequence is executed due to per thread Be it is random, be equal to upset original iteration execute sequence, this requires that cannot have any dependence between each iteration, The subsequent dependence by between iteration, which is referred to as to recycle to carry, to be relied on.Any circulation, which carries the presence relied on, will prevent the parallel of program Change, and then influences thread execution efficiency, coenocytism hardware runnability and expense.

Summary of the invention

For this purpose, the present invention provides a kind of data optimization methods and dress towards Thread-Level Parallelism in shared drive coenocytism It sets, makes full use of the hardware performance of coenocytism, meet the multi-threaded parallel demand of circulation, realize the thread-level of loop-around data simultaneously The validity that row executes improves multi-core processor data execution efficiency.

According to design scheme provided by the present invention, a kind of data towards Thread-Level Parallelism in shared drive coenocytism Optimization method includes: being directed to circulation to be optimized, detecting and determining that its all circulation carries to rely on whether there is privatization profitability Situation；And determined according to detection as a result, implementing data privatization processing.

Above-mentioned, can the advantageous implementations of privatization refer to that all circulations carry and rely on and be carried out by data privatization It eliminates.

Above-mentioned, it detects and determines that all circulations carry the advantageous implementations of privatization relied on, include following content:

For the target circulation layer of data to be optimized, to wherein each scalar, if each in the target circulation layer Reference carries out assignment by current iteration, then determines to rely on privatization processing caused by the scalar, otherwise, it is determined that the scalar is not necessarily to Data privatization processing；To wherein each array, if the array rely on be carried by the circulation layer inside target circulation layer, and Each reference of each array element in target circulation stacking generation is detected by current iteration assignment, then determines that the array is drawn The dependence privatization processing risen, otherwise, it is determined that the array is not necessarily to data privatization.

Preferably, foundation detection determines as a result, dependence privatization processing caused by progress scalar, includes following content:

For dependence caused by the scalar handled to privatization, first every sentence in traversal loop body, searches and collect It is wherein directed to the write operation of the scalar, and is added it in set W, if set W is sky, without carrying out scalar privatization, Otherwise, traversal set W detects whether it is specification variable, if so, by specification for each scalar being related in set Change of variable is induction variable；Set W is traversed, privatization processing is carried out to each scalar being related in set.

Further, array or privatization clause modification are extended using scalar to carry out privatization processing to scalar.

Preferably, foundation detection determines as a result, dependence privatization processing caused by progress array, includes following content:

For dependence caused by the array handled to privatization, array of traversal loop weight every first accesses sentence, The write operation for being wherein directed to array is searched and collected, is added it in set AW；If set AW is sky, without carrying out array Otherwise privatization traverses set AW, for each array reference being related in set, detect it with the presence or absence of specification, if depositing It is being then to conclude by reduction transformation；Set AW is traversed, privatization processing is carried out to each array being related in set.

Further, privatization processing is carried out to array using array dimension extension or privatization clause modification.

A kind of data-optimized device towards Thread-Level Parallelism in shared drive coenocytism includes test module and optimization Module, wherein

Test module, for being directed to data to be optimized, detecting and determining that its all circulation carries dependence with the presence or absence of privately owned Change advantageous implementations, determines that the situation for needing privatization to handle is fed back to optimization module by result according to detection；

Optimization module, for implementing data privatization processing according to test module feedback result.

In above-mentioned device, test module includes detection sub-module one and detection sub-module two, wherein

Detection sub-module one, for being directed to the target circulation layer of data to be optimized, to wherein each scalar, if in the mesh The each reference marked in circulation layer carries out assignment by current iteration, then determines to rely on privatization processing caused by the scalar, no Then, determine that the scalar is handled without data privatization；

Detection sub-module two, for being directed to the target circulation layer of data to be optimized, to wherein each array, if the array Dependence is to be carried by the circulation layer inside target circulation layer, and it is every in target circulation stacking generation to detect each array element Secondary reference is then determined to rely on privatization processing caused by the array, otherwise, it is determined that the array is without number by current iteration assignment According to privatization.

In above-mentioned device, optimization module includes optimization submodule one and optimization submodule two, wherein

Optimize submodule one, for according to feedback result, for dependence caused by the scalar handled to privatization, first time Every sentence in loop body is gone through, search and collects the write operation for being wherein directed to the scalar, and is added it in set W, if collection Closing W is sky, then without carrying out scalar privatization, otherwise, traverses set W for each scalar being related in set and detect it It whether is specification variable, if so, being induction variable by specification change of variable；Set W is traversed, to each of being related in set Scalar carries out privatization processing；

Optimize submodule two, for according to feedback result, for dependence caused by the array handled to privatization, first time The circulation array of weight every access sentence is gone through, searches and collects the write operation for being wherein directed to array, add it to set AW In；If set AW is sky, without carrying out array privatization, otherwise, set AW is traversed, for the every number being related in set Group reference detects it with the presence or absence of specification, and if it exists, is then to conclude by reduction transformation；Set AW is traversed, to being related in set Each array carry out privatization processing.

Beneficial effects of the present invention:

The present invention carries out the profitability of privatization processing for parallel thread is recycled in multicore hardware configuration to data first Then detection is eliminated circulation by data privatization and carries the multi-threaded parallel for relying on, and then realizing program；And further will Scalar privatization and array privatization are divided into data privatization, are made full use of the multicore hardware configuration advantage of shared storage, are mentioned High thread execution efficiency, and the utilization rate of processor or even computer system performance is improved, operation expense is reduced, to shared drive Coenocytism Thread-Level Parallelism processing technique has important directive significance.

Detailed description of the invention:

Fig. 1 is data optimization methods flow diagram in embodiment；

Fig. 2 is OpenMP Parallel Execution Model schematic diagram in embodiment；

Fig. 3 is data-optimized schematic device in embodiment；

Fig. 4 is test module schematic diagram in embodiment；

Fig. 5 is optimization module schematic diagram in embodiment.

Specific embodiment:

To make the object, technical solutions and advantages of the present invention clearer, understand, with reference to the accompanying drawing with technical solution pair The present invention is described in further detail.

The execution model of OpenMP is as shown in Fig. 2, when starting to execute, and only main thread is being run, and main thread exists In operational process, when encountering the region for needing to carry out parallel computation, derive (Fork, create new thread or wake up it is wired Journey) multiple threads execute parallel task, and it when parallel execute, main thread and derives from thread and works together, in parallel area Terminate execute after, derive from thread exit or hang up, no longer work, control flow return in individual main thread (Join, i.e., it is more Thread converges).The multi-threaded parallel of circulation is realized by OpenMP with can be convenient.In circulation multi-threaded parallel, due to Per thread execution sequence be it is random, carry this requires cannot have circulation between each iteration and rely on, otherwise can prevent journey The parallelization of sequence, and then the runnability for influencing execution efficiency and multi-core processor etc..In consideration of it, the embodiment of the present invention, referring to Shown in Fig. 1, a kind of data optimization methods towards Thread-Level Parallelism in shared drive coenocytism are provided, include following content:

S101, data to be optimized are directed to, detecting and determining that its all circulation carries to rely on whether there is privatization profitability Situation；

S102, determined according to detection as a result, implementing data privatization processing.

In the embodiment of the present invention, whether the test of progress privatization profitability is first can eliminate institute when judging privatization Have and hinder parallel dependence, then implements to rely on by data privatization for test result and eliminate work, abolish between iteration Dependence, realize the Thread-Level Parallelism of circulation, make full use of the hardware runnability of coenocytism, improve program and execute effect Rate.

When detecting and determining that all circulations carry the advantageous implementations of privatizations relied on, in further embodiment of the present invention, The detection decision process is designed as comprising following content:

Because data privatization can consume memory headroom, the effect that might not have been brought.Therefore data privatization it Before, it detects whether that all circulations carry and relies on and can be eliminated by data privatization；If it can, so continuing to implement Otherwise real data privatization stops data privatization processing, avoids unnecessary program executive overhead.Test scalar draws The carrying risen relies on test, the circulation layer to parallelization is known as target circulation layer, for each scalar, if followed in target Quoting in circular layer iteration each time is all the value calculated by current iteration, then relying on caused by declared amount can lead to Privatization is crossed to eliminate, and continues profitable test in next step.For each array rely on, need to check carrying according to Circulation layer to parallelization is known as target circulation layer by bad circulation layer, and there are three kinds of situations: (1) relying on is by target circulation What the circulation layer except layer carried；(2) rely on is carried by target circulation layer；(3) relying on is by inside target circulation layer What circulation layer carried；For the first case, it is carried due to relying on by outer loop, target circulation layer is relied on without carrying, array Privatization is not needed；For second situation, dependence is carried by target circulation layer, if by the dependence followed by target What circular layer carried, then the dependence can not be eliminated by array privatization, profitable test crash；For the third situation, If it is what is carried by the circulation layer inside target circulation layer, each array element is continued to test, if it is in target circulation The value quoted each time in stacking generation is calculated by current iteration, then the dependence can be disappeared by privatization It removes, otherwise, can not eliminate, profitable test crash.

According to detection judgement as a result, carrying out relying on privatization processing caused by scalar, in another embodiment of the present invention, it is somebody's turn to do Scalar privatization processing includes following content:

Further, array is extended using scalar or privatization clause modifies, such as private, scalar carried out private Having processing.

According to detection judgement as a result, carrying out relying on privatization processing caused by array, in another embodiment of the present invention, it is somebody's turn to do Array privatization processing includes following content:

Further, it is modified, such as private, array is carried out private using array dimension extension or privatization clause Having processing.

When array detects whether as specification, can by determine its whether be " ARRAY [i]=ARRAY [i] op XX " shape Reduction transformation is to conclude if there is specification caused by array by formula.For the array of each privatization, if array element Read operation is only existed in loop body, then each array element being related to is initialized, and the array of each privatization is silent Recognizing value is all 0, the situation for avoiding may not being inconsistent with initial value etc., and then influences executing efficiency.

Based on above-mentioned data optimization methods, the embodiment of the present invention also provides one kind towards shared drive coenocytism middle line The parallel data-optimized device of journey grade, it is shown in Figure 3, it include test module 101 and optimization module 102, wherein

Test module 101, for being directed to data to be optimized, detecting and determining that its all circulation carries to rely on whether there is private The advantageous implementations of having determine that the situation for needing privatization to handle is fed back to optimization module by result according to detection；

Optimization module 102, for implementing data privatization processing according to test module feedback result.

Shown in Figure 4 in above-mentioned device, test module 101 includes detection sub-module 1 and detection sub-module 2 1002, wherein

Detection sub-module 1, for being directed to the target circulation layer of data to be optimized, to wherein each scalar, if Each reference in the target circulation layer carries out assignment by current iteration, then determines to rely at privatization caused by the scalar Reason, otherwise, it is determined that the scalar is handled without data privatization；

Detection sub-module 2 1002, for being directed to the target circulation layer of data to be optimized, to wherein each array, if should It is to be carried by the circulation layer inside target circulation layer, and detect each array element in target circulation stacking generation that array, which relies on, Each reference by current iteration assignment, then determine to rely on privatization processing caused by the array, otherwise, it is determined that the array without Need data privatization.

Shown in Figure 5 in above-mentioned device, optimization module 102 includes optimization submodule 1 and optimization submodule 2 2002, wherein

Optimize submodule 1, it is first for dependence caused by the scalar handled to privatization for foundation feedback result Every sentence in first traversal loop body is searched and collects the write operation for being wherein directed to the scalar, and adds it in set W, If set W is sky, without carrying out scalar privatization, otherwise, set W is traversed, for each scalar being related in set, inspection Survey whether it is specification variable, if so, being induction variable by specification change of variable；Set W is traversed, to what is be related in set Each scalar carries out privatization processing；

Optimize submodule 2 2002, it is first for dependence caused by the array handled to privatization for foundation feedback result First array of traversal loop weight every accesses sentence, searches and collects the write operation for being wherein directed to array, add it to collection It closes in AW；If set AW is sky, without carrying out array privatization, otherwise, set AW is traversed, it is every for what is be related in set A array reference detects it with the presence or absence of specification, and if it exists, is then to conclude by reduction transformation；Set AW is traversed, to relating in set And each array arrived carries out privatization processing.

Based on the above content, in the embodiment of the present invention, further by can refer to pseudocode content and specific example do into One step explanation, scalar, array privatization processing pseudocode design content and example are as follows:

Scalar privatization pseudocode:

Scalar privatization example:

Array privatization pseudocode:

Array privatization example is as follows:

Based on above-mentioned method, the embodiment of the present invention also provides a kind of server, comprising: one or more processors；It deposits Storage device, for storing one or more programs, when one or more of programs are executed by one or more of processors, So that one or more of processors realize above-mentioned method.

Based on above-mentioned method, the embodiment of the present invention also provides a kind of computer-readable medium, is stored thereon with computer Program, wherein the program realizes above-mentioned method when being executed by processor.

Unless specifically stated otherwise, the opposite step of the component and step that otherwise illustrate in these embodiments, digital table It is not limit the scope of the invention up to formula and numerical value.

The technical effect and preceding method embodiment phase of device provided by the embodiment of the present invention, realization principle and generation Together, to briefly describe, Installation practice part does not refer to place, can refer to corresponding contents in preceding method embodiment.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description It with the specific work process of device, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.

In all examples being illustrated and described herein, any occurrence should be construed as merely illustratively, without It is as limitation, therefore, other examples of exemplary embodiment can have different values.

It should also be noted that similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined in a attached drawing, does not then need that it is further defined and explained in subsequent attached drawing.

The flow chart and block diagram in the drawings show the system of multiple embodiments according to the present invention, method and computer journeys The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, section or code of table, a part of the module, section or code include one or more use The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two continuous boxes can actually base Originally it is performed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.It is also noted that It is the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart, can uses and execute rule The dedicated hardware based system of fixed function or movement is realized, or can use the group of specialized hardware and computer instruction It closes to realize.

In several embodiments provided herein, it should be understood that disclosed systems, devices and methods, it can be with It realizes by another way.The apparatus embodiments described above are merely exemplary, for example, the division of the unit, Only a kind of logical function partition, there may be another division manner in actual implementation, in another example, multiple units or components can To combine or be desirably integrated into another system, or some features can be ignored or not executed.Another point, it is shown or beg for The mutual coupling, direct-coupling or communication connection of opinion can be through some communication interfaces, device or unit it is indirect Coupling or communication connection can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.

It, can be with if the function is realized in the form of SFU software functional unit and when sold or used as an independent product It is stored in the executable non-volatile computer-readable storage medium of a processor.Based on this understanding, of the invention Technical solution substantially the part of the part that contributes to existing technology or the technical solution can be with software in other words The form of product embodies, which is stored in a storage medium, including some instructions use so that One computer equipment (can be personal computer, server or the network equipment etc.) executes each embodiment institute of the present invention State all or part of the steps of method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (ROM, Read- Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. are various can be with Store the medium of program code.

Finally, it should be noted that embodiment described above, only a specific embodiment of the invention, to illustrate the present invention Technical solution, rather than its limitations, scope of protection of the present invention is not limited thereto, although with reference to the foregoing embodiments to this hair It is bright to be described in detail, those skilled in the art should understand that: anyone skilled in the art In the technical scope disclosed by the present invention, it can still modify to technical solution documented by previous embodiment or can be light It is readily conceivable that variation or equivalent replacement of some of the technical features；And these modifications, variation or replacement, do not make The essence of corresponding technical solution is detached from the spirit and scope of technical solution of the embodiment of the present invention, should all cover in protection of the invention Within the scope of.Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. a kind of data optimization methods towards Thread-Level Parallelism in shared drive coenocytism, characterized by comprising:

For circulation to be optimized, detecting and determining that its all circulation carries to rely on whether there is the advantageous implementations of privatization；And according to Determined according to detection as a result, implementing data privatization processing.

2. the data optimization methods according to claim 1 towards Thread-Level Parallelism in shared drive coenocytism, special Sign is that can the advantageous implementations of privatization refer to that all circulations carry and rely on and disappear by data privatization It removes.

3. the data optimization methods according to claim 1 towards Thread-Level Parallelism in shared drive coenocytism, special Sign is, detects and determines that all circulations carry the advantageous implementations of privatization relied on, includes following content:

For the target circulation layer of data to be optimized, to wherein each scalar, if each reference in the target circulation layer Assignment is carried out by current iteration, then determines to rely on privatization processing caused by the scalar, otherwise, it is determined that the scalar is not necessarily to data Privatization processing；To wherein each array, if it is to be carried by the circulation layer inside target circulation layer, and detect that the array, which relies on, Each reference of each array element in target circulation stacking generation is by current iteration assignment, then caused by determining the array Privatization processing is relied on, otherwise, it is determined that the array is not necessarily to data privatization.

4. the data optimization methods according to claim 3 towards Thread-Level Parallelism in shared drive coenocytism, special Sign is, according to detection judgement as a result, carrying out relying on privatization processing caused by scalar, includes following content:

For dependence caused by the scalar handled to privatization, first every sentence in traversal loop body, searches and collect wherein For the write operation of the scalar, and add it in set W, if set W is sky, without carrying out scalar privatization, otherwise, It traverses set W and detects whether it is specification variable, if so, specification variable is become for each scalar being related in set It is changed to induction variable；Set W is traversed, privatization processing is carried out to each scalar being related in set.

5. the data optimization methods according to claim 1 or 4 towards Thread-Level Parallelism in shared drive coenocytism, It is characterized in that, array or privatization clause modification is extended using scalar to carry out privatization processing to scalar.

6. the data optimization methods according to claim 3 towards Thread-Level Parallelism in shared drive coenocytism, special Sign is, according to detection judgement as a result, carrying out relying on privatization processing caused by array, includes following content:

For dependence caused by the array handled to privatization, array of traversal loop weight every first accesses sentence, searches And the write operation for being wherein directed to array is collected, it adds it in set AW；If set AW is sky, privately owned without carrying out array Change, otherwise, traverse set AW, for each array reference being related in set, detects it with the presence or absence of specification, and if it exists, It is then to conclude by reduction transformation；Set AW is traversed, privatization processing is carried out to each array being related in set.

7. the data optimization methods according to claim 1 or 6 towards Thread-Level Parallelism in shared drive coenocytism, It is characterized in that, privatization processing is carried out to array using array dimension extension or privatization clause modification.

8. a kind of data-optimized device towards Thread-Level Parallelism in shared drive coenocytism, which is characterized in that include test Module and optimization module, wherein

Test module, for being directed to data to be optimized, detecting and determining that its all circulation carries dependence and has with the presence or absence of privatization Sharp implementations determine that the situation for needing privatization to handle is fed back to optimization module by result according to detection；

9. the data-optimized device according to claim 8 towards Thread-Level Parallelism in shared drive coenocytism, special Sign is that test module includes detection sub-module one and detection sub-module two, wherein

Detection sub-module one, for being directed to the target circulation layer of data to be optimized, to wherein each scalar, if being followed in the target Each reference in circular layer carries out assignment by current iteration, then determines to rely on privatization processing caused by the scalar, otherwise, sentence The fixed scalar is handled without data privatization；

Detection sub-module two, for being directed to the target circulation layer of data to be optimized, to wherein each array, if the array relies on It is to be carried by the circulation layer inside target circulation layer, and detect each array element drawing in target circulation stacking generation every time With by current iteration assignment, then determining that relying on privatization caused by the array is handled, otherwise, it is determined that the array is private without data Having.

10. the data-optimized device according to claim 8 towards Thread-Level Parallelism in shared drive coenocytism, special Sign is that optimization module includes optimization submodule one and optimization submodule two, wherein

Optimize submodule one, for according to feedback result, for dependence caused by the scalar handled to privatization, traversal first to be followed Every sentence in ring body is searched and collects the write operation for being wherein directed to the scalar, and adds it in set W, if set W is Otherwise sky, traverses set W for each scalar being related in set and whether detects it then without carrying out scalar privatization For specification variable, if so, being induction variable by specification change of variable；Set W is traversed, to each scalar being related in set Carry out privatization processing；

Optimize submodule two, for according to feedback result, for dependence caused by the array handled to privatization, traversal first to be followed Ring body every array of weight accesses sentence, searches and collects the write operation for being wherein directed to array, add it in set AW；If Set AW is sky, then without carrying out array privatization, otherwise, traverses set AW, draw for each array being related in set With detecting it whether there is specification, and if it exists, be then to conclude by reduction transformation；Set AW is traversed, it is every to what is be related in set A array carries out privatization processing.