CN101944014B - Method for realizing automatic pipeline parallelism - Google Patents

Method for realizing automatic pipeline parallelism Download PDF

Info

Publication number
CN101944014B
CN101944014B CN 201010281797 CN201010281797A CN101944014B CN 101944014 B CN101944014 B CN 101944014B CN 201010281797 CN201010281797 CN 201010281797 CN 201010281797 A CN201010281797 A CN 201010281797A CN 101944014 B CN101944014 B CN 101944014B
Authority
CN
China
Prior art keywords
parallel
thread
flowing water
dependence
loop structure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 201010281797
Other languages
Chinese (zh)
Other versions
CN101944014A (en
Inventor
杨克峤
李弋
臧斌宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN 201010281797 priority Critical patent/CN101944014B/en
Publication of CN101944014A publication Critical patent/CN101944014A/en
Application granted granted Critical
Publication of CN101944014B publication Critical patent/CN101944014B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Devices For Executing Special Programs (AREA)

Abstract

The invention belongs to the technical field of program compilation and in particular relates to a method for realizing automatic pipeline parallelism. The method of the invention mainly comprises the following steps of: (1) identification of the pipeline parallelism, namely judging a loop structure which is provided with cross-loop iteration dependence and a dependence distance vector is a constant; (2) synchronization among threads, namely inserting the synchronization according to the dependence distance vector and deleting the redundant synchronization with the same distance vector; and (3) thread scheduling in a static step length, namely self-defining a thread scheduling strategy for balancing the workload of each thread and reducing the communication expense. The type identification of the loop structure is depended on the conventional array data stream analysis and dependence tests, while the pipeline parallelism only processes the regular loop structure with backward cross-loop iteration. The synchronization expense of the pipeline parallelism is high, so the pipeline parallelism is only performed on the outmost layer of a nested loop. Profit of the pipeline parallelism depends on programs, the number of the cyclic iteration is larger and the dependence distance is longer, the performance promotion is greater. The method for realizing the pipeline parallelism improves the capacity of automatic parallel optimization and contributes to further improving the performance of scientific calculation programs.

Description

The implementation method that a kind of automatic streamline is parallel
Technical field
The invention belongs to the program compilation technical field, be specifically related to the parallel implementation method of a kind of automatic streamline.
Background technology
The modern computing machine architecture develops towards the multi-thread direction of multinuclear, and the automatic program parallelization technology is proposed new challenge.Falling behind in relatively of parallelization technique of compiling restricted efficiently using and popularizing of high-performance computer on certain program.Serial program is compiled into concurrent program automatically for performance multinuclear advantage, improves program feature and promote the development of concurrent computational system to have great significance.
Circulation is to contain the abundantest a kind of structure of concurrency in the program, and also program is carried out part the most consuming time often.Be concurrency analysis at loop body to the essence of the automatic paralleling optimization of serial program.Can be divided into two kinds to the parallel of circulation: parallel fully and flowing water walks abreast.Parallel fully is the loop structure that relies at no datat between loop iteration, and this also is the loop structure that allows in the OpenMP programming model.For existing the circulation that relies on to have the forward direction of being divided into to rely on the back to dependence between loop iteration.Forward direction relies on, and namely stream is relevant, and the array read operation in this loop iteration relies on the write operation in the previous cycles iteration, can pass through cyclic deformation, realizes parallel fully; The back is to dependence, and the array read operation in this loop iteration must be before the array write operation of follow-up loop iteration.To dependence, if the dependence distance vector is constant, then can pass through flowing water Parallel Implementation executed in parallel for the back.Flowing water parallel needs realizes that by the scheduling of the data sync between circulation and parallel thread the flowing water of cross-thread is parallel, thereby many than complete parallel processing difficulty.Domestic parallelization research more lays particular emphasis on parallel fully loop structure.
The present invention proposes the parallel implementation method of a kind of automatic streamline, utilizes the data sync of spin lock realization cross-thread, and self-defined round-robin scheduling strategy finishes walking abreast to containing to stride the iterative data dependence and rely on vectorial loop structure expansion flowing water for constant.The present invention has not only increased the cyclical patterns of automatic paralleling optimization, and has broken external technical monopoly, and domestic high-performance calculation is had significance.
Summary of the invention
The object of the present invention is to provide the parallel implementation method of a kind of automatic streamline, with balanced load, realize optimization in Properties.
The implementation method that the automatic streamline that the present invention proposes is parallel, its step comprises: the identification that (1) flowing water is parallel; (2) cross-thread is synchronous; (3) thread scheduling of static step-length.Wherein:
The identification that flowing water is parallel is the feature according to loop structure, and it is parallel to judge whether to carry out flowing water to circulation, namely judges it is that the back is the loop structure of constant to relying on and relying on distance vector between loop iteration, and it is parallel to carry out flowing water; And by the parallel overhead model, assess parallel whether being worth of this flowing water and launch;
Cross-thread is synchronous, is the same beans-and bullets shooter according to the length computation flowing water that relies on distance vector, realizes the synchronous of cross-thread;
The thread scheduling of static step-length, the self-defined thread scheduling strategy of the present invention is guaranteed the operating load equilibrium of each thread.
Among the present invention, automatic streamline is parallel to have carried out automatic paralleling processing to striding the loop structure that relies on to constant behind the loop iteration, has increased the optimization ability of automatic paralleling optimization.
Among the present invention, utilize the spin lock realization to foundation and the emptying of multithreading streamline; By relying on distance vector, realize the synchronous of cross-thread.
Among the present invention, self-defined static step-length scheduling strategy has reduced communication-cost, realizes the load balancing of cross-thread.
The present invention be in automatic paralleling optimizing process to parallel fully replenishing, when loop structure can not adopt when parallel fully, it is parallel to judge whether to carry out flowing water again.Flowing water is parallel to need cross-thread synchronous, and parallel overhead is bigger than parallel fully.When the loop iteration number of times more big, or the data between the adjacent circulation iteration to rely on distance more far away, then the parallel income of flowing water is more big, synchronization overhead can be ignored.Back when relying on distance and be not constant when between loop iteration, the mode that then adopts serial to carry out, the present invention can not handle.
Description of drawings
Fig. 1 is walk abreast position in whole automatic paralleling optimization of automatic streamline.
Fig. 2 is the parallel execution example of flowing water.
Embodiment
Below concrete operations step of the present invention is further described.
Figure 1 shows that the whole automatic paralleling treatment scheme of the present invention.
The first, the dependence type of judgement loop structure
Automatic paralleling optimization at first is transformed into source program the form of intermediate representation.The intermediate representation of program carries out abstract to source program with structured form, and be recorded in the various information of collecting and producing in process analysis and the optimizing process, and it provides required program information support for each stage of process analysis, conversion and optimization.Based on the intermediate representation of program, launch the parallel type identification to loop structure, be labeled as parallel fully, flowing water is parallel or serial.
1. by traditional data dependency analysis and dependence test, compare standard for the circulation that does not have data to rely between loop iteration and loop structure, be designated parallel fully; Rely on for existing between loop iteration, judge that according to relying on the positive and negative of distance vector it still is that the back is to dependence that forward direction relies on.
Adopt Sk (I), wherein I=(i1, i2 ..., ik) conclude vector for circulation, represent a particular cycle iteration (I1=i1, I2=i2 ..., statement example Sk Ik=ik), then just like giving a definition:
Definition 1: if there is following dependence to take place: statement Sq (J) waits for finishing of Sp (I), and the entrance that then claims this dependence is Sp (I), exports into Sq (J).Rely on if contain the data of striding loop iteration, then Sp (I) and Sq (J) are in different loop iterations.
The definition 2: if the entrance of a dependence is Sp (I), export into Sq (J), then the distance vector of Yi Laiing be J-I=(j1-i1, j2-i2 ..., jc-ic), wherein c is the innermost loop of Sp and Sq.
If distance vector is constant, and being negative, then for forward direction relies on, as is positive number, then is that the back is to dependence; If relying on distance vector is not constant, then be that irregular circulation relies on, the executive mode of serial is adopted in this circulation.
Rely on for forward direction, adopt the cyclic transformation technology, as technology such as loop distribution, circulation integration technology, unimodular transformations.And for rely on vector for constant and be the back to the circulation that relies on, it is parallel then to be designated flowing water.The present invention only pays close attention to can the parallel loop structure of flowing water.
2. adopt the parallel overhead model to carry out Performance Evaluation to flowing water is parallel, as performance benefits, then adopt flowing water parallel, otherwise the parallel label of the flowing water on the deletion loop structure.Because the synchronization overhead that nested circulation internal layer flowing water walks abreast is bigger, it is parallel therefore only to keep the outermost flowing water of nested circulation, to the parallel executive mode that adopts serial of the flowing water of interior loop.
The second, insert synchronously according to relying on vector
At the streamline establishment stage, according to the parallel line number of passes, establish scope and the initialization of spin lock.Guarantee that by spin lock each thread enters streamline sequentially.Empty stage at streamline also will rely on spin lock to realize emptying as required.Modes to be recycled such as free lock employing condition realize.Determine the synchronous of cross-thread by relying on distance vector, realize by the calling system built-in function, as the built-in function synchronize () of Fortran.Redundancy for identical dependence distance can be deleted synchronously, reduces synchronization overhead.Fig. 2 is the parallel synoptic diagram of a flowing water.
The 3rd, before parallel loop structure, insert self-defining thread scheduling function
OpenMP3.0 programming standard allows User Defined thread scheduling algorithm.For balanced load, the parallel effect of performance reduces communication-cost.The present invention adopts static step-length scheduling strategy.Make L, U is respectively lower bound and the previous term of circulation, number of threads T<U-L+1, and then the block size of the distribution of each thread is the upper bound of (U-L+1)/T.
The 4th, translate into the concurrent program that band OpenMP marks
The program conversion of source-to-source is generally adopted in automatic paralleling optimization, source program is translated into the source program of the parallel indication of band OpenMP by analysis optimization.According to the OpenMP standard, convert the centre sign to parallel source program, and identify the shared variable in the parallel loop structure of flowing water, privatization variable etc.

Claims (1)

1. the implementation method that automatic streamline is parallel is characterized in that concrete steps are: the identification that (1) flowing water is parallel; (2) cross-thread is synchronous; (3) thread scheduling of static step-length; Wherein:
The identification that flowing water is parallel is the feature according to loop structure, and it is parallel to judge whether to carry out flowing water to circulation, namely judges it is that the back is the loop structure of constant to relying on and relying on distance vector between loop iteration, and it is parallel to carry out flowing water; And by the parallel overhead model, assess parallel whether being worth of this flowing water and launch;
Cross-thread is synchronous, is the same beans-and bullets shooter according to the length computation flowing water that relies on distance vector, realizes the synchronous of cross-thread;
The thread scheduling of static step-length is self-defined thread scheduling strategy, guarantees the operating load equilibrium of each thread;
The concrete operations step is:
The first, the dependence type of judgement loop structure:
At first source program is transformed into the form of intermediate representation, the intermediate representation of program is to carry out abstract with structured form to source program, and be recorded in process analysis and the optimizing process the various information of collecting and producing, for each stage of process analysis, conversion and optimization provides required program information support; Based on the intermediate representation of program, launch the parallel type identification to loop structure, be labeled as parallel fully, flowing water is parallel or serial, the steps include:
1) by the test of existing data dependency analysis and dependence, for the circulation that does not have data to rely between loop iteration and loop structure standard relatively, be designated parallel fully; For exist relying between loop iteration, positive and negatively judge that it still is the back to dependence that forward direction relies on according to what rely on distance vector again: if distance vector is constant, and be negative, then be the forward direction dependence; If distance vector is constant, and being positive number, then is that the back is to dependence; If relying on distance vector is not constant, then be that irregular circulation relies on, the executive mode of serial is adopted in this circulation;
2) adopt the parallel overhead model to carry out Performance Evaluation to flowing water is parallel, as performance benefits, then adopt flowing water parallel, otherwise the parallel label of the flowing water on the deletion loop structure; And it is only parallel to the outermost layer expansion flowing water of nested circulation;
The second, insert synchronously according to relying on vector:
At the streamline establishment stage, according to the parallel line number of passes, establish scope and the initialization of spin lock; Guarantee that by spin lock each thread enters streamline sequentially; Empty stage at streamline relies on spin lock to realize emptying as required; Modes to be recycled such as spin lock employing condition realize; Determine the synchronous of cross-thread by relying on distance vector, realized by the calling system built-in function, for the redundant deletion synchronously of identical dependence distance;
The 3rd, before parallel loop structure, insert self-defining thread scheduling function:
Adopt static step-length scheduling strategy, make L, U is respectively lower bound and the previous term of circulation, number of threads T<U-L+1, and then the block size of the distribution of each thread is the upper bound of (U-L+1)/T;
The 4th, translate into the concurrent program that band OpenMP marks:
Adopt the program conversion of source-to-source, source program is translated into the source program of the parallel indication of band OpenMP by analysis optimization.
CN 201010281797 2010-09-15 2010-09-15 Method for realizing automatic pipeline parallelism Expired - Fee Related CN101944014B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010281797 CN101944014B (en) 2010-09-15 2010-09-15 Method for realizing automatic pipeline parallelism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010281797 CN101944014B (en) 2010-09-15 2010-09-15 Method for realizing automatic pipeline parallelism

Publications (2)

Publication Number Publication Date
CN101944014A CN101944014A (en) 2011-01-12
CN101944014B true CN101944014B (en) 2013-08-21

Family

ID=43436015

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010281797 Expired - Fee Related CN101944014B (en) 2010-09-15 2010-09-15 Method for realizing automatic pipeline parallelism

Country Status (1)

Country Link
CN (1) CN101944014B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102200924B (en) * 2011-05-17 2014-07-16 北京北大众志微系统科技有限责任公司 Modulus-scheduling-based compiling method and device for realizing circular instruction scheduling
CN102508776B (en) * 2011-11-03 2014-09-17 中国人民解放军国防科学技术大学 Automatic construction method for evaluation stimulus of multi-thread cross double-precision short-vector structure
CN103246541B (en) * 2013-04-27 2016-03-23 中国人民解放军信息工程大学 A kind of automatically parallelizing multistage parallel cost evaluation method
CN104298600A (en) * 2014-10-23 2015-01-21 广州华多网络科技有限公司 Software testing method and device
CN105302624B (en) * 2015-09-17 2018-10-26 哈尔滨工程大学 Start spacing automatic analysis method between cycle flowing water iteration in a kind of reconfigurable compiling device
CN106445666B (en) * 2016-09-26 2019-10-11 西安交通大学 A kind of parallel optimization method of DOACROSS circulation
CN109522126B (en) * 2018-11-19 2020-04-24 中国人民解放军战略支援部队信息工程大学 Thread-level parallel data optimization method and device in shared memory multi-core structure

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1458586A (en) * 2003-06-07 2003-11-26 顾士平 Method for realizing next generation of high performance computer
CN1877532A (en) * 2005-06-06 2006-12-13 松下电器产业株式会社 Compiler apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1458586A (en) * 2003-06-07 2003-11-26 顾士平 Method for realizing next generation of high performance computer
CN1877532A (en) * 2005-06-06 2006-12-13 松下电器产业株式会社 Compiler apparatus

Also Published As

Publication number Publication date
CN101944014A (en) 2011-01-12

Similar Documents

Publication Publication Date Title
CN101944014B (en) Method for realizing automatic pipeline parallelism
Rhu et al. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design
CN112306678B (en) Method and system for parallel processing of algorithms based on heterogeneous many-core processor
US8752036B2 (en) Throughput-aware software pipelining for highly multi-threaded systems
CN103049245B (en) A kind of software performance optimization method based on central processor CPU multi-core platform
CN101963918B (en) Method for realizing virtual execution environment of central processing unit (CPU)/graphics processing unit (GPU) heterogeneous platform
CN101807144B (en) Prospective multi-threaded parallel execution optimization method
Xiao et al. A load balancing inspired optimization framework for exascale multicore systems: A complex networks approach
CN102981807B (en) Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment
Kim et al. Efficient SIMD code generation for irregular kernels
US20170083319A1 (en) Generation and use of block branch metadata
CN107247628B (en) Data flow program task dividing and scheduling method for multi-core system
CN101833438A (en) General data processing method based on multiple parallel
Samadi et al. Paragon: Collaborative speculative loop execution on gpu and cpu
EP3350689A1 (en) Multi-nullification
CN102799418B (en) Processor architecture and instruction execution method integrating sequence and VLIW (Very Long Instruction Word)
Giovannini et al. A hybrid parallelization strategy of a cfd code for turbomachinery applications
CN109062636A (en) A kind of data processing method, device, equipment and medium
Anantpur et al. Runtime dependence computation and execution of loops on heterogeneous systems
CN103207786B (en) Gradual intelligent backtracking vector code tuning method
CN101655783A (en) Forward-looking multithreading partitioning method
Zheng et al. Performance model for OpenMP parallelized loops
Popov et al. Piecewise holistic autotuning of compiler and runtime parameters
CN103091708B (en) A kind of 3-D seismics tectonic erosion periods performance optimization method
CN102981805B (en) The response method of serialized software and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130821

Termination date: 20160915

CF01 Termination of patent right due to non-payment of annual fee