CN101944014B - Method for realizing automatic pipeline parallelism - Google Patents
Method for realizing automatic pipeline parallelism Download PDFInfo
- Publication number
- CN101944014B CN101944014B CN 201010281797 CN201010281797A CN101944014B CN 101944014 B CN101944014 B CN 101944014B CN 201010281797 CN201010281797 CN 201010281797 CN 201010281797 A CN201010281797 A CN 201010281797A CN 101944014 B CN101944014 B CN 101944014B
- Authority
- CN
- China
- Prior art keywords
- parallel
- thread
- flowing water
- dependence
- loop structure
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Landscapes
- Devices For Executing Special Programs (AREA)
Abstract
The invention belongs to the technical field of program compilation and in particular relates to a method for realizing automatic pipeline parallelism. The method of the invention mainly comprises the following steps of: (1) identification of the pipeline parallelism, namely judging a loop structure which is provided with cross-loop iteration dependence and a dependence distance vector is a constant; (2) synchronization among threads, namely inserting the synchronization according to the dependence distance vector and deleting the redundant synchronization with the same distance vector; and (3) thread scheduling in a static step length, namely self-defining a thread scheduling strategy for balancing the workload of each thread and reducing the communication expense. The type identification of the loop structure is depended on the conventional array data stream analysis and dependence tests, while the pipeline parallelism only processes the regular loop structure with backward cross-loop iteration. The synchronization expense of the pipeline parallelism is high, so the pipeline parallelism is only performed on the outmost layer of a nested loop. Profit of the pipeline parallelism depends on programs, the number of the cyclic iteration is larger and the dependence distance is longer, the performance promotion is greater. The method for realizing the pipeline parallelism improves the capacity of automatic parallel optimization and contributes to further improving the performance of scientific calculation programs.
Description
Technical field
The invention belongs to the program compilation technical field, be specifically related to the parallel implementation method of a kind of automatic streamline.
Background technology
The modern computing machine architecture develops towards the multi-thread direction of multinuclear, and the automatic program parallelization technology is proposed new challenge.Falling behind in relatively of parallelization technique of compiling restricted efficiently using and popularizing of high-performance computer on certain program.Serial program is compiled into concurrent program automatically for performance multinuclear advantage, improves program feature and promote the development of concurrent computational system to have great significance.
Circulation is to contain the abundantest a kind of structure of concurrency in the program, and also program is carried out part the most consuming time often.Be concurrency analysis at loop body to the essence of the automatic paralleling optimization of serial program.Can be divided into two kinds to the parallel of circulation: parallel fully and flowing water walks abreast.Parallel fully is the loop structure that relies at no datat between loop iteration, and this also is the loop structure that allows in the OpenMP programming model.For existing the circulation that relies on to have the forward direction of being divided into to rely on the back to dependence between loop iteration.Forward direction relies on, and namely stream is relevant, and the array read operation in this loop iteration relies on the write operation in the previous cycles iteration, can pass through cyclic deformation, realizes parallel fully; The back is to dependence, and the array read operation in this loop iteration must be before the array write operation of follow-up loop iteration.To dependence, if the dependence distance vector is constant, then can pass through flowing water Parallel Implementation executed in parallel for the back.Flowing water parallel needs realizes that by the scheduling of the data sync between circulation and parallel thread the flowing water of cross-thread is parallel, thereby many than complete parallel processing difficulty.Domestic parallelization research more lays particular emphasis on parallel fully loop structure.
The present invention proposes the parallel implementation method of a kind of automatic streamline, utilizes the data sync of spin lock realization cross-thread, and self-defined round-robin scheduling strategy finishes walking abreast to containing to stride the iterative data dependence and rely on vectorial loop structure expansion flowing water for constant.The present invention has not only increased the cyclical patterns of automatic paralleling optimization, and has broken external technical monopoly, and domestic high-performance calculation is had significance.
Summary of the invention
The object of the present invention is to provide the parallel implementation method of a kind of automatic streamline, with balanced load, realize optimization in Properties.
The implementation method that the automatic streamline that the present invention proposes is parallel, its step comprises: the identification that (1) flowing water is parallel; (2) cross-thread is synchronous; (3) thread scheduling of static step-length.Wherein:
The identification that flowing water is parallel is the feature according to loop structure, and it is parallel to judge whether to carry out flowing water to circulation, namely judges it is that the back is the loop structure of constant to relying on and relying on distance vector between loop iteration, and it is parallel to carry out flowing water; And by the parallel overhead model, assess parallel whether being worth of this flowing water and launch;
Cross-thread is synchronous, is the same beans-and bullets shooter according to the length computation flowing water that relies on distance vector, realizes the synchronous of cross-thread;
The thread scheduling of static step-length, the self-defined thread scheduling strategy of the present invention is guaranteed the operating load equilibrium of each thread.
Among the present invention, automatic streamline is parallel to have carried out automatic paralleling processing to striding the loop structure that relies on to constant behind the loop iteration, has increased the optimization ability of automatic paralleling optimization.
Among the present invention, utilize the spin lock realization to foundation and the emptying of multithreading streamline; By relying on distance vector, realize the synchronous of cross-thread.
Among the present invention, self-defined static step-length scheduling strategy has reduced communication-cost, realizes the load balancing of cross-thread.
The present invention be in automatic paralleling optimizing process to parallel fully replenishing, when loop structure can not adopt when parallel fully, it is parallel to judge whether to carry out flowing water again.Flowing water is parallel to need cross-thread synchronous, and parallel overhead is bigger than parallel fully.When the loop iteration number of times more big, or the data between the adjacent circulation iteration to rely on distance more far away, then the parallel income of flowing water is more big, synchronization overhead can be ignored.Back when relying on distance and be not constant when between loop iteration, the mode that then adopts serial to carry out, the present invention can not handle.
Description of drawings
Fig. 1 is walk abreast position in whole automatic paralleling optimization of automatic streamline.
Fig. 2 is the parallel execution example of flowing water.
Embodiment
Below concrete operations step of the present invention is further described.
Figure 1 shows that the whole automatic paralleling treatment scheme of the present invention.
The first, the dependence type of judgement loop structure
Automatic paralleling optimization at first is transformed into source program the form of intermediate representation.The intermediate representation of program carries out abstract to source program with structured form, and be recorded in the various information of collecting and producing in process analysis and the optimizing process, and it provides required program information support for each stage of process analysis, conversion and optimization.Based on the intermediate representation of program, launch the parallel type identification to loop structure, be labeled as parallel fully, flowing water is parallel or serial.
1. by traditional data dependency analysis and dependence test, compare standard for the circulation that does not have data to rely between loop iteration and loop structure, be designated parallel fully; Rely on for existing between loop iteration, judge that according to relying on the positive and negative of distance vector it still is that the back is to dependence that forward direction relies on.
Adopt Sk (I), wherein I=(i1, i2 ..., ik) conclude vector for circulation, represent a particular cycle iteration (I1=i1, I2=i2 ..., statement example Sk Ik=ik), then just like giving a definition:
Definition 1: if there is following dependence to take place: statement Sq (J) waits for finishing of Sp (I), and the entrance that then claims this dependence is Sp (I), exports into Sq (J).Rely on if contain the data of striding loop iteration, then Sp (I) and Sq (J) are in different loop iterations.
The definition 2: if the entrance of a dependence is Sp (I), export into Sq (J), then the distance vector of Yi Laiing be J-I=(j1-i1, j2-i2 ..., jc-ic), wherein c is the innermost loop of Sp and Sq.
If distance vector is constant, and being negative, then for forward direction relies on, as is positive number, then is that the back is to dependence; If relying on distance vector is not constant, then be that irregular circulation relies on, the executive mode of serial is adopted in this circulation.
Rely on for forward direction, adopt the cyclic transformation technology, as technology such as loop distribution, circulation integration technology, unimodular transformations.And for rely on vector for constant and be the back to the circulation that relies on, it is parallel then to be designated flowing water.The present invention only pays close attention to can the parallel loop structure of flowing water.
2. adopt the parallel overhead model to carry out Performance Evaluation to flowing water is parallel, as performance benefits, then adopt flowing water parallel, otherwise the parallel label of the flowing water on the deletion loop structure.Because the synchronization overhead that nested circulation internal layer flowing water walks abreast is bigger, it is parallel therefore only to keep the outermost flowing water of nested circulation, to the parallel executive mode that adopts serial of the flowing water of interior loop.
The second, insert synchronously according to relying on vector
At the streamline establishment stage, according to the parallel line number of passes, establish scope and the initialization of spin lock.Guarantee that by spin lock each thread enters streamline sequentially.Empty stage at streamline also will rely on spin lock to realize emptying as required.Modes to be recycled such as free lock employing condition realize.Determine the synchronous of cross-thread by relying on distance vector, realize by the calling system built-in function, as the built-in function synchronize () of Fortran.Redundancy for identical dependence distance can be deleted synchronously, reduces synchronization overhead.Fig. 2 is the parallel synoptic diagram of a flowing water.
The 3rd, before parallel loop structure, insert self-defining thread scheduling function
OpenMP3.0 programming standard allows User Defined thread scheduling algorithm.For balanced load, the parallel effect of performance reduces communication-cost.The present invention adopts static step-length scheduling strategy.Make L, U is respectively lower bound and the previous term of circulation, number of threads T<U-L+1, and then the block size of the distribution of each thread is the upper bound of (U-L+1)/T.
The 4th, translate into the concurrent program that band OpenMP marks
The program conversion of source-to-source is generally adopted in automatic paralleling optimization, source program is translated into the source program of the parallel indication of band OpenMP by analysis optimization.According to the OpenMP standard, convert the centre sign to parallel source program, and identify the shared variable in the parallel loop structure of flowing water, privatization variable etc.
Claims (1)
1. the implementation method that automatic streamline is parallel is characterized in that concrete steps are: the identification that (1) flowing water is parallel; (2) cross-thread is synchronous; (3) thread scheduling of static step-length; Wherein:
The identification that flowing water is parallel is the feature according to loop structure, and it is parallel to judge whether to carry out flowing water to circulation, namely judges it is that the back is the loop structure of constant to relying on and relying on distance vector between loop iteration, and it is parallel to carry out flowing water; And by the parallel overhead model, assess parallel whether being worth of this flowing water and launch;
Cross-thread is synchronous, is the same beans-and bullets shooter according to the length computation flowing water that relies on distance vector, realizes the synchronous of cross-thread;
The thread scheduling of static step-length is self-defined thread scheduling strategy, guarantees the operating load equilibrium of each thread;
The concrete operations step is:
The first, the dependence type of judgement loop structure:
At first source program is transformed into the form of intermediate representation, the intermediate representation of program is to carry out abstract with structured form to source program, and be recorded in process analysis and the optimizing process the various information of collecting and producing, for each stage of process analysis, conversion and optimization provides required program information support; Based on the intermediate representation of program, launch the parallel type identification to loop structure, be labeled as parallel fully, flowing water is parallel or serial, the steps include:
1) by the test of existing data dependency analysis and dependence, for the circulation that does not have data to rely between loop iteration and loop structure standard relatively, be designated parallel fully; For exist relying between loop iteration, positive and negatively judge that it still is the back to dependence that forward direction relies on according to what rely on distance vector again: if distance vector is constant, and be negative, then be the forward direction dependence; If distance vector is constant, and being positive number, then is that the back is to dependence; If relying on distance vector is not constant, then be that irregular circulation relies on, the executive mode of serial is adopted in this circulation;
2) adopt the parallel overhead model to carry out Performance Evaluation to flowing water is parallel, as performance benefits, then adopt flowing water parallel, otherwise the parallel label of the flowing water on the deletion loop structure; And it is only parallel to the outermost layer expansion flowing water of nested circulation;
The second, insert synchronously according to relying on vector:
At the streamline establishment stage, according to the parallel line number of passes, establish scope and the initialization of spin lock; Guarantee that by spin lock each thread enters streamline sequentially; Empty stage at streamline relies on spin lock to realize emptying as required; Modes to be recycled such as spin lock employing condition realize; Determine the synchronous of cross-thread by relying on distance vector, realized by the calling system built-in function, for the redundant deletion synchronously of identical dependence distance;
The 3rd, before parallel loop structure, insert self-defining thread scheduling function:
Adopt static step-length scheduling strategy, make L, U is respectively lower bound and the previous term of circulation, number of threads T<U-L+1, and then the block size of the distribution of each thread is the upper bound of (U-L+1)/T;
The 4th, translate into the concurrent program that band OpenMP marks:
Adopt the program conversion of source-to-source, source program is translated into the source program of the parallel indication of band OpenMP by analysis optimization.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010281797 CN101944014B (en) | 2010-09-15 | 2010-09-15 | Method for realizing automatic pipeline parallelism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010281797 CN101944014B (en) | 2010-09-15 | 2010-09-15 | Method for realizing automatic pipeline parallelism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101944014A CN101944014A (en) | 2011-01-12 |
CN101944014B true CN101944014B (en) | 2013-08-21 |
Family
ID=43436015
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201010281797 Expired - Fee Related CN101944014B (en) | 2010-09-15 | 2010-09-15 | Method for realizing automatic pipeline parallelism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101944014B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102200924B (en) * | 2011-05-17 | 2014-07-16 | 北京北大众志微系统科技有限责任公司 | Modulus-scheduling-based compiling method and device for realizing circular instruction scheduling |
CN102508776B (en) * | 2011-11-03 | 2014-09-17 | 中国人民解放军国防科学技术大学 | Automatic construction method for evaluation stimulus of multi-thread cross double-precision short-vector structure |
CN103246541B (en) * | 2013-04-27 | 2016-03-23 | 中国人民解放军信息工程大学 | A kind of automatically parallelizing multistage parallel cost evaluation method |
CN104298600A (en) * | 2014-10-23 | 2015-01-21 | 广州华多网络科技有限公司 | Software testing method and device |
CN105302624B (en) * | 2015-09-17 | 2018-10-26 | 哈尔滨工程大学 | Start spacing automatic analysis method between cycle flowing water iteration in a kind of reconfigurable compiling device |
CN106445666B (en) * | 2016-09-26 | 2019-10-11 | 西安交通大学 | A kind of parallel optimization method of DOACROSS circulation |
CN109522126B (en) * | 2018-11-19 | 2020-04-24 | 中国人民解放军战略支援部队信息工程大学 | Thread-level parallel data optimization method and device in shared memory multi-core structure |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1458586A (en) * | 2003-06-07 | 2003-11-26 | 顾士平 | Method for realizing next generation of high performance computer |
CN1877532A (en) * | 2005-06-06 | 2006-12-13 | 松下电器产业株式会社 | Compiler apparatus |
-
2010
- 2010-09-15 CN CN 201010281797 patent/CN101944014B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1458586A (en) * | 2003-06-07 | 2003-11-26 | 顾士平 | Method for realizing next generation of high performance computer |
CN1877532A (en) * | 2005-06-06 | 2006-12-13 | 松下电器产业株式会社 | Compiler apparatus |
Also Published As
Publication number | Publication date |
---|---|
CN101944014A (en) | 2011-01-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101944014B (en) | Method for realizing automatic pipeline parallelism | |
Rhu et al. | vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design | |
CN112306678B (en) | Method and system for parallel processing of algorithms based on heterogeneous many-core processor | |
US8752036B2 (en) | Throughput-aware software pipelining for highly multi-threaded systems | |
CN103049245B (en) | A kind of software performance optimization method based on central processor CPU multi-core platform | |
CN101963918B (en) | Method for realizing virtual execution environment of central processing unit (CPU)/graphics processing unit (GPU) heterogeneous platform | |
CN101807144B (en) | Prospective multi-threaded parallel execution optimization method | |
Xiao et al. | A load balancing inspired optimization framework for exascale multicore systems: A complex networks approach | |
CN102981807B (en) | Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment | |
Kim et al. | Efficient SIMD code generation for irregular kernels | |
US20170083319A1 (en) | Generation and use of block branch metadata | |
CN107247628B (en) | Data flow program task dividing and scheduling method for multi-core system | |
CN101833438A (en) | General data processing method based on multiple parallel | |
Samadi et al. | Paragon: Collaborative speculative loop execution on gpu and cpu | |
EP3350689A1 (en) | Multi-nullification | |
CN102799418B (en) | Processor architecture and instruction execution method integrating sequence and VLIW (Very Long Instruction Word) | |
Giovannini et al. | A hybrid parallelization strategy of a cfd code for turbomachinery applications | |
CN109062636A (en) | A kind of data processing method, device, equipment and medium | |
Anantpur et al. | Runtime dependence computation and execution of loops on heterogeneous systems | |
CN103207786B (en) | Gradual intelligent backtracking vector code tuning method | |
CN101655783A (en) | Forward-looking multithreading partitioning method | |
Zheng et al. | Performance model for OpenMP parallelized loops | |
Popov et al. | Piecewise holistic autotuning of compiler and runtime parameters | |
CN103091708B (en) | A kind of 3-D seismics tectonic erosion periods performance optimization method | |
CN102981805B (en) | The response method of serialized software and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20130821 Termination date: 20160915 |
|
CF01 | Termination of patent right due to non-payment of annual fee |